Final Monk

7/30/2019 Final Monk

1/12

Increasing Performance of a High-Resolution Monitor Wall

Jason Monk

Advisors: Prof. Bruce Segee

Department of Electrical and Computer Engineering

University of MaineOrono, USA

[email protected]

Abstract

This paper will examine several aspects of renderingOpenGL graphics on large displays using VirtualGL andVNC. It will look at increasing performance in a coupleaspects of rendering on a monitor wall, in particular aCUDA enhanced version of VirtualGL as well as theadvantages to having multiple VNC servers. It will discussrestrictions caused by read back and blitting rates andhow they are affected by different sizes being rendered. ACUDA extension for VirtualGL was developed allowingfor faster read back at high resolutions.

I. INTRO

As we move into the digital age increase in display size

is becoming more and more popular. The most commonly

thought of way for this is to use a projector. A projector is

a useful tool, however as you increase the size each pixel

gets bigger, and the image looks worse and worse. A less

common alternative to a projector is a monitor wall, several

monitors set up in a grid. This solution allows large images

while maintaining lots of detail, when adding more monitors

the other monitors do not increase pixel size, as they would

with a projector increase in size.There are several ways to achieve high-resolution data

visualization on both a hardware a software layer. Although

most high-resolution systems involve this monitor grid there

are a few ways to distribute the image to each display. Some

create the image on one computer and distribute the image

to client machines afterwards, whereas others will have each

image create its own piece of the image.

The four by four monitor wall located at the UMaine

innovation center is using a system where one computer

creates the image and then it is distributed. To do this the

computers are running two software packages, xtightvncserver

to create a large display and distribute it, and VirtualGL to do

rendering for applications run in the VNC server. Currentlythe setup can achieve frame rates of one to two fames per

second while rendering close to full screen (about twenty mega

pixels).

When nVidia started producing the G8X series of cards they

started implementing a architecture called CUDA, and most

of their video cards since then have had CUDA support. With

this new architecture they provided extensions for C/C++ that

create an Application Programming Interface (API) allowing

code to executed on the GPU. Since then the concept of

GPGPU (general purpose graphics processing unit) has been

growing, this is the concept that the GPU is very good a

algebra and running things in parallel so we should take use

of that power for other applications.

Fig. 1. Display Wall at Innovation Center

I I . PROJECT GOAL

The goal of this project is to increase performance of the

display wall at the innovation center, preferably by harnessing

unused processing power available in the GPUs, through

CUDA, of the computers hosting the wall.

III . READ BACK RATE TESTING

At the start of the project it was believed that the factor

limiting the frame rate of the display was read back of the

video card (the process of getting each frame from the video

card to the CPU for distributing). There are no reliable read

rates available on the internet of these cards so the rates needed

to be tested.A program was written (Appendix A) that would use CUDA

to write and read a large random amount of data to and

from the video card memory and then later process the times

collected, the program was later expanded to test several types

of memory accessible from the function cudaMemcpy. The

data collected showed that the Write Speeds were usually

between 1700 and 2000 MB/s and the Read Speeds were only

slightly slower between 1500 and 1800 MB/s, results shown in

figure 2. When writing to the card the cudaMemcpy function


2/12

appeared to have a overhead of about 1.2 ms, when reading

from the card was only 37 us. Using page-locked memory

can significantly increase performance of both read and write

speeds to over 3GB/s.

Fig. 2. Read and Write Test on GTX260

What had not been taken into account was that when reading

back was occurring on the wall rendering may or may not also

be taking place. So a new set of tests were done while running

something requiring rendering on the video card, for this test

glxgears was used because it was already set up on the testing

machine and it was easy to run automated testing with.

Fig. 3. Read Speeds while Rendering

Read and write speeds were tested while rendering square

glxgears windows, from widths of 50 on up to 1150. The

results showed that the read and write speeds while rendering

were directly proportional to the number of pixels rendered.

Figure 3 is a graph of the read speeds while rendering various

square resolutions. The plot clearly follows a very linear trend.

Notice that there is a very large drop from the CUDA testing

program being run by itself to any rendering at all happening,

There is a several hundred MB/s drop in bandwidth the GLX

rendering is occurring.

Using this linear regression we can extrapolate that there

would be no bandwidth left while rendering at sizes upwards

of 12 MP, let alone 20 MP. This shows that read back

during rendering is undesirable and better frame rates can

only be achieved if rendering and reading are performed

synchronously one after another. The question became what

exactly is VirtualGL doing so the next step was looking at the

VirtualGL source code to see exactly what it was doing.

IV. VIRTUALGL

VirtualGL is software that will intercept OpenGL com-

mands from any program, it then will run all of the OpenGL

commands on the video card in a off screen buffer called

a PBuffer. Once each frame is rendered it reads the frame

from the PBuffer and brings it back to host memory (memory

available to CPU), now available to other software such as a

VNC server. VirtualGL is written mostly in C++ with a few

pieces of it in C. Already written into the code for VirtualGL

is a profiler to track the frame rate of VGL. The profiler will

time the read back, the blit (writing to X server), and the total

frame rate.

VirtualGL as well as a VNC server were compiled and

installed on a separate machine to do some testing of Vir-

tualGL speeds and so any code changes could be run in a test

environment instead of the working system. When running

the profiling on the testing machine (at low resolutions) it

confirmed that the issue was with read back, therefore read

back speed needed to be increased. VirtualGL calls an OpenGLfunction named glReadPixels. It is widely acknowledged that

glReadPixels does not perform nearly as fast as it should.

Performance testing showed that glReadPixels was not trans-

ferring at the full speed found earlier with CUDA. This means

if the transfer could be performed using CUDA rather than the

GL call it could run faster.

nVidia has been working on a set of CUDA functions to

allow interoperability between CUDA and OpenGL. Currently

it allows a OpenGL Pixel Buffer Object to be mapped to

a pointer available to CUDA calls. The problem with using

CUDA to read the frame is that the frame first has to be moved

from the PBuffer it was rendered into to a PBO (Pixel Buffer

Object). The only way to move the pixels from the PBufferto a PBO is using the OpenGL function glReadPixels. This

glReadPixels call may not be as slow as the original, since the

transfer is between GPU memory, which has high bandwidth

(greater than 10 GB/s). There was a similar example in the

nVidia CUDA SDK, the example rendered then would call

glReadPixels to move pixels to a PBO, then would map it

to a pointer readable by CUDA calls. In the nVidia example

glReadPixels would copy at a speed of 10-40 GB/s, as opposed

to copying to host memory which runs at less than 1 GB/s.


3/12

nVidias example provided in the CUDA SDK requested

that it be rendered in a different pixels format than the other

programs used for testing VirtualGL, because it requested it

not only have RGB pixels but alpha also. The function con-

trolling these pixel formats is glXChooseVisual, which takes

a list of attributes for the window and produces an XVisual

structure to be used. Since VirtualGL already intercepts this

call only a simple change is required to replace any requested

attributes with the ones to increase performance. The increase

performance is seen because glReadPixels takes a parameter of

what format to read to pixels into, when the format requested

and the format of the buffer do not match a conversion must

be done creating a lot of overhead.

This process of mapping and un-mapping a PBO each read

cycle creates a lot of overhead. When dealing with a small

window, only a few mega-pixels, the overhead is so great that

no read back increase can be found. However when larger

resolutions are reached, specifically 20 MP, an increase in

read back rate can be found. When testing GoogleEarth at

full screen of 20 MP an increase in read back of 30% was

found, from the original 120MP/s in the testing environmentto a speed around 160 MP/s. To achieve this speedup several

things must be done to setup the process whenever a PBuffer

is created.

1) Before any CUDA GL calls can be made cudaGLSet-

GLDevice must be called, it is also good practice to call

cudaSetDevice to make sure which device is being used.

2) The PBO must be created and filled with data, this

can be done using glGenBuffers, glBindBuffer, and

glBufferData.

3) The PBO must be registered with CUDA using cud-

aGLRegisterBufferObject, upon exiting cudaGLUnreg-

isterBufferObject should be called.

Each time the frame is read several steps are also required

to make it possible.

1) The PBO must be bound to the pack buffer and filled

using glReadPixels.

2) The PBO must be unbound from the pack buffer and

mapped using cudaGLMapBuffer.

3) The data must be read from the PBO using cudaMem-

cpy.

4) The PBO must be unmapped using cudaGLUnmap-

Buffer.

The entire process from the users program to the X server is

shown in figure 4. Although this process is more complex it

is faster because CUDA is better is utilizing the bandwidthbetween the CPU and the GPU. This section of code for

VirtualGL can be found in Appendix B.

Once the frame was already being transferred through

CUDA, an attempt at transferring and blitting only the changes

required was attempted. A simple algorithm in which if there

was a change in a set of 512 pixels was found those 512 pixels

would be updated. The overhead associated with checking for

changes and transferring small groups of pixels is too large to

make this system worth using, but shows promise for solving

the problem associated with blitting large frames to X server.

Fig. 4. Flow for CUDA Enhanced Read back

V. VNC

When testing the performance of code changes in VirtualGL

a significant difference was found between the display wall

at the innovation center and the test environment, the test

environment was limited by the time it takes to read back,

whereas the display wall, at much higher resolution, is limitedby the blit speed. In the blitting process VirtualGL would make

a series of X calls to draw to the window then call XSync

which is what takes most of the time, so the current speed did

not appear to be due to VirtualGL not being fast enough.

At that time VNC server was also showing as taking 100%

of time of one of the CPUs, since the VNC server was

not multi-threaded it could not take any more time than

it was. According to a profiling of VNC server done in

the past, compression was taking the most amount of time.

Compression was enabled because with no encryption the

client machines were often crashing. Attempting raw mode

again was the next step to reduce CPU usage. By changing

the read and write network buffers, raw mode was able torun for several minutes. In the several minutes it performed at

about the same frame rate as before, however the VNC server

was taking much less CPU and the network usage had gone

up significantly (probably saturated during image transfer).

Since these two different transfer types had the same

performance it meant between there was likely a point with

balance on compression and network usage that would get

better performance. After a few tests, better performance could

be achieved by setting the JPEG quality setting on the VNC

clients to 3, which increased the frame rate to 2-3 frames per

second, with noticeable image quality degradation. Now the

CPU is not spending all of its time but the network is not at

full capacity either. Knowing that such a performance changecan be found in the VNC server the next logical step is running

it in parallel.

The VNC server being used handled each request sequen-

tially, in this environment 2 main clients were accessing the

server, each of the nodes having eight monitors. Since they

are handled sequentially the second client must always wait

for the first to finish for its request to be processed. Although

the VNC server is not using all of the CPU anymore it is still

using more than half the time, meaning it is quite busy.


4/12

The VNC server being used was a program called xtightvnc,

a modified version of the original Xvnc. There is, to date, no

version of Xvnc that handles client requests in parallel. When

Xvnc is launched it handles all X Server setup required to

create a second display (e.g. display :1). There is another

common VNC server called x11vnc which connects to an

already existing display and allowing clients to view whatever

is on that display, this is most commonly used by users that

want to have remote access to display :0.

A simple test was performed in attempt to relieve the CPU

restrictions of serving clients sequentially. Instead of having

the client nodes connect to a single VNC server, the clients

connected to individual x11vnc servers, one to handle the top

half of the screen and the other to handle the bottom half.

The performance increase was easily noticeable by human

eyes. After this xtightvnc was no longer used and a X virtual

frame buffer was used to create display :1 with x11vnc

servers were setup for each client that wants to connect. The

same 2-3 frames per second could be achieved without the

quality degradation found by changing compression and losing

quality.There are many advantages to having multiple x11vnc

servers rather than a single Xvnc server, such as each client not

affecting the speed of another and being able to display only

sections of the screen. There are also disadvantages, one worth

noting is that clients are no longer synchronized in updates

and might have different frame rates. The other cause for

concern is the x11vnc can update the same time as VirtualGL

is updating the screen causing tearing or other strange affects

to the image.

VI . CONCLUSION

This project has found several factors affecting performance

of rendering on a monitor wall. A balance between networkand CPU usage for the VNC server needs to be found for

optimal performance, this is easiest by having multiple VNC

servers. Also found is that read back rate can be increased

noticeably by using CUDA to read pixels rather than reading

through OpenGL.


5/12

APPENDIX

A. CUDA Memory Bandwidth Testing Program

#include #include #include #include

#include

#include

#define PINNED 0#define CPUGPU 1#define GPUGPU 2

#define CPUCPU 3#define PINTOUNPIN 4#define PINTOPIN 5#define GPINTOUNPIN 6

int memtest(int type, long long int bytes, double *to, double *from);void printtype(int type);

//#define SAMPLES 4#define MEMCPY 0

int main(int argc, char * argv[])

{

long long int bytes;double **toarray,**fromarray,*to,*from;long long int **bytesarray,*bytesr;int curtype;int i;double averageto, averagefrom;

int SAMPLES;

if (argc == 2) sscanf(argv[1],"%d",&SAMPLES);else SAMPLES = 3;cutilSafeCall(cudaSetDevice(0));

toarray = (double**)malloc(6*sizeof(double*));

fromarray = (double **)malloc(6*sizeof(double*));bytesarray = (long long int**)malloc(6*sizeof(int*));for ( i = 0 ; i < 6 ; i + + ) {

toarray[i] = (double *)malloc(SAMPLES*sizeof(double));fromarray[i] = (double *)malloc(SAMPLES*sizeof(double));

bytesarray[i] = (long long int *)malloc(SAMPLES*sizeof(long long int));}

for (curtype = 0; curtype < 6; curtype++) {i = 0 ;

to = toarray[curtype];

from = fromarray[curtype];

bytesr = bytesarray[curtype];

while ((i < SAMPLES)) {bytes = rand() % 153600000;

if (bytes < 1000000) continue;if (!memtest(curtype,bytes,to+i,from+i)) {

bytesr[i++] = bytes;

}

}

}

printf("Type:\t\t\tTo (MB/s)\t\tFrom (MB/s)\n");

for (curtype = 0; curtype < 6; curtype++) {

to = toarray[curtype];


6/12

from = fromarray[curtype];

bytesr = bytesarray[curtype];

averageto = 0;

averagefrom = 0;

for (i = 0; i < SAMPLES; i++) {

averageto += ((*(bytesr+i)/1024.0/1024.0) / (*(to+i))*(1000000000.0));

averagefrom += ((*(bytesr+i)/1024.0/1024.0) / (*(from+i))*(1000000000.0));

}

averageto /= SAMPLES;averagefrom /= SAMPLES;

printtype(curtype);

printf("\t%7.2lf\t\t\t%7.2lf\n",averageto,averagefrom);

}

return 0;

}

void printtype(int type)

{

switch(type) {case PINNED:

printf("GPUTOPinned Memory: ");

break;case CPUGPU:printf("CPU to GPU Memory: ");

break;

case GPUGPU:printf("GPU to GPU Memory: ");

break;case CPUCPU:

printf("CPU to CPU Memory: ");

break;case PINTOUNPIN:

printf("CPUPIN to CPU Memory: ");

break;case PINTOPIN:

printf("PIN to PIN Memory: ");

break;

case GPINTOUNPIN:printf("GPU to CPUPIN to CPU Memory: \n");

break;}

return;}

int memtest(int type, long long int bytes, double *to, double *from){

int pinned;void *h,*d, *tmp;int i = 0 ;

int j;double rate1,rate2;struct timespec start, stop;

if (type == GPUGPU) {tmp = malloc(bytes);

if (!tmp) return 1;

d = NULL;

cudaMalloc(&d,bytes);

if (!d) {return 1;

}

h = NULL;

cudaMalloc(&h,bytes);


7/12

if (!h) {

cudaFree(d);

return 1;}

for (i = 0; i < (bytes/sizeof(double)); i++) {

*((double*)tmp+i) = rand();}

cutilSafeCall(cudaMemcpy(h,tmp,bytes,cudaMemcpyHostToDevice));

clock_gettime(CLOCK_MONOTONIC,&start);

cutilSafeCall(cudaMemcpy(d,h,bytes,cudaMemcpyDeviceToDevice));

clock_gettime(CLOCK_MONOTONIC,&stop);

rate1 = ((stop.tv_nsec-start.tv_nsec));


cutilSafeCall(cudaMemcpy(h,d,bytes,cudaMemcpyDeviceToDevice));



printf("%lld\t%.2lf\t%.2lf\n",bytes,rate1,rate2);

free(tmp);

cudaFree(d);

cudaFree(h);

}if ((type == CPUCPU)||(type == PINTOUNPIN)||(type == PINTOPIN)) {if ((type == PINTOUNPIN)||(type == PINTOPIN)) {

cutilSafeCall( cudaMallocHost((void **)&h,bytes) );

} else {h = malloc(bytes);

}

if (!h) {return 1;

}

if ((type == PINTOPIN)) {

cutilSafeCall( cudaMallocHost((void **)&d,bytes) );} else {

d = malloc(bytes);

}

if (!d) {

if ((type == PINTOUNPIN)||(type == PINTOPIN)) cudaFreeHost(h);else free(h);return 1;

}


*((double*)h+i) = rand();

}


#if MEMCPYmemcpy(d,h,bytes);

#else

cutilSafeCall(cudaMemcpy(d,h,bytes,cudaMemcpyHostToHost));

#endif




*((double*)d+i) = rand();

}


#if MEMCPYmemcpy(h,d,bytes);

#else

cudaMemcpy(h,d,bytes,cudaMemcpyHostToHost);

#endif



8/12



if ((type == PINTOUNPIN)||(type == PINTOPIN)) {

cudaFreeHost(h);

} else {

free(h);

}

if ((type == PINTOPIN)) {

cudaFreeHost(d);} else {

free(d);

}

}

if ((type == PINNED)||(type == CPUGPU)) {

if (type == PINNED) {pinned = 1;

} else {pinned = 0;

}

h = NULL;

if (pinned == 0) {

h = malloc(bytes);

} else {cutilSafeCall( cudaMallocHost((void **)&h,bytes) );}

if (!h) {

return 1;}

d = NULL;

cudaMalloc(&d,bytes);

if (!d) {free(h);

return 1;

}

for (j = 0; j < (bytes/sizeof(double)); j++) {

*((double*)h+j) = rand();}


cutilSafeCall(cudaMemcpy(d,h,bytes,cudaMemcpyHostToDevice));




cudaMemcpy(h,d,bytes,cudaMemcpyDeviceToHost);



if ((rate1 < 0) || (rate2 < 0)) return 1;


if (pinned == 0) {

free(h);} else {

cutilSafeCall(cudaFreeHost(h));

}

cudaFree(d);

}

*to = rate1;

*from = rate2;

return 0;}


9/12

B. Changes to pbwin.cpp in VirtualGL

void pbwin::readpixels(GLint x, GLint y, GLint w, GLint pitch, GLint h,

GLenum format, int ps, GLubyte *bits, GLint buf, bool stereo){

static int zfq = 0;struct timespec start,stop;

GLint readbuf=GL_BACK;_glGetIntegerv(GL_READ_BUFFER, &readbuf);

tempctx tc(_localdpy, EXISTING_DRAWABLE, GetCurrentDrawable());

glReadBuffer(buf);

glPushClientAttrib(GL_CLIENT_PIXEL_STORE_BIT);

if(pitch%8==0) glPixelStorei(GL_PACK_ALIGNMENT, 8);else if(pitch%4==0) glPixelStorei(GL_PACK_ALIGNMENT, 4);else if(pitch%2==0) glPixelStorei(GL_PACK_ALIGNMENT, 2);else if(pitch%1==0) glPixelStorei(GL_PACK_ALIGNMENT, 1);

int e=glGetError();while(e!=GL_NO_ERROR) e=glGetError(); // Clear previous error

if ((!first)&&((cw!=w)||(ch!=h))) {

if (!cudafl) {cudafl = 1;

i = 1 ;

rrout.PRINT("[VGL] Resolution Change Attempting CUDA Acceleration (%d,%

d) to (%d,%d)\n",cw,ch,w,h);

}

cudachangesize(w,h);

cw = w;

ch = h;

}

if (first) {cudafl = 1;

}

if (cudafl) {if (first) {

static int go = 1;int i = 0 ;cw = w;

ch = h;

glewInit();

cudastart(w,h);

if (go) {

go = 0;

cutilSafeCall(cudaSetDevice(0));

cutilSafeCall(cudaGLSetGLDevice(0));

}

while (cudamakebuffer(x,y/(1


10/12

_prof_rb.startframe();

glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB,buffer);


_glReadPixels(x, y, w, h, format, GL_UNSIGNED_BYTE, (GLvoid *)NULL);


rrout.PRINT("[VGL] glReadPixels took %lf ns\n",(double)stop.tv_nsec -

start.tv_nsec);if ((((double)stop.tv_nsec - start.tv_nsec) > 1000000)||(((double)stop.

tv_nsec - start.tv_nsec)


11/12

mlib_ImageLookUp_Inp(image, (const void **)luts);

mlib_ImageDelete(image);

}

else

{

#endif

if(first){

first=false;

if(fconfig.verbose)rrout.println("[VGL] Using software gamma correction (

correction factor=%f)\n",

(double)fconfig.gamma);

}

unsigned short *ptr1, *ptr2=(unsigned short *)(&bits[pitch*h]);for(ptr1=(unsigned short *)bits; ptr1


12/12

snprintf(_autotestframe, 79, "__VGL_AUTOTESTFRAME%x=%d", (unsigned int)_win,

_autotestframecount);

putenv(_autotestframe);

}

glPopClientAttrib();

tc.restore();

glReadBuffer(readbuf);

}

Documents

Final Monk