Upload
proxymo1
View
213
Download
0
Embed Size (px)
Citation preview
7/30/2019 Final Monk
1/12
Increasing Performance of a High-Resolution Monitor Wall
Jason Monk
Advisors: Prof. Bruce Segee
Department of Electrical and Computer Engineering
University of MaineOrono, USA
Abstract
This paper will examine several aspects of renderingOpenGL graphics on large displays using VirtualGL andVNC. It will look at increasing performance in a coupleaspects of rendering on a monitor wall, in particular aCUDA enhanced version of VirtualGL as well as theadvantages to having multiple VNC servers. It will discussrestrictions caused by read back and blitting rates andhow they are affected by different sizes being rendered. ACUDA extension for VirtualGL was developed allowingfor faster read back at high resolutions.
I. INTRO
As we move into the digital age increase in display size
is becoming more and more popular. The most commonly
thought of way for this is to use a projector. A projector is
a useful tool, however as you increase the size each pixel
gets bigger, and the image looks worse and worse. A less
common alternative to a projector is a monitor wall, several
monitors set up in a grid. This solution allows large images
while maintaining lots of detail, when adding more monitors
the other monitors do not increase pixel size, as they would
with a projector increase in size.There are several ways to achieve high-resolution data
visualization on both a hardware a software layer. Although
most high-resolution systems involve this monitor grid there
are a few ways to distribute the image to each display. Some
create the image on one computer and distribute the image
to client machines afterwards, whereas others will have each
image create its own piece of the image.
The four by four monitor wall located at the UMaine
innovation center is using a system where one computer
creates the image and then it is distributed. To do this the
computers are running two software packages, xtightvncserver
to create a large display and distribute it, and VirtualGL to do
rendering for applications run in the VNC server. Currentlythe setup can achieve frame rates of one to two fames per
second while rendering close to full screen (about twenty mega
pixels).
When nVidia started producing the G8X series of cards they
started implementing a architecture called CUDA, and most
of their video cards since then have had CUDA support. With
this new architecture they provided extensions for C/C++ that
create an Application Programming Interface (API) allowing
code to executed on the GPU. Since then the concept of
GPGPU (general purpose graphics processing unit) has been
growing, this is the concept that the GPU is very good a
algebra and running things in parallel so we should take use
of that power for other applications.
Fig. 1. Display Wall at Innovation Center
I I . PROJECT GOAL
The goal of this project is to increase performance of the
display wall at the innovation center, preferably by harnessing
unused processing power available in the GPUs, through
CUDA, of the computers hosting the wall.
III . READ BACK RATE TESTING
At the start of the project it was believed that the factor
limiting the frame rate of the display was read back of the
video card (the process of getting each frame from the video
card to the CPU for distributing). There are no reliable read
rates available on the internet of these cards so the rates needed
to be tested.A program was written (Appendix A) that would use CUDA
to write and read a large random amount of data to and
from the video card memory and then later process the times
collected, the program was later expanded to test several types
of memory accessible from the function cudaMemcpy. The
data collected showed that the Write Speeds were usually
between 1700 and 2000 MB/s and the Read Speeds were only
slightly slower between 1500 and 1800 MB/s, results shown in
figure 2. When writing to the card the cudaMemcpy function
7/30/2019 Final Monk
2/12
appeared to have a overhead of about 1.2 ms, when reading
from the card was only 37 us. Using page-locked memory
can significantly increase performance of both read and write
speeds to over 3GB/s.
Fig. 2. Read and Write Test on GTX260
What had not been taken into account was that when reading
back was occurring on the wall rendering may or may not also
be taking place. So a new set of tests were done while running
something requiring rendering on the video card, for this test
glxgears was used because it was already set up on the testing
machine and it was easy to run automated testing with.
Fig. 3. Read Speeds while Rendering
Read and write speeds were tested while rendering square
glxgears windows, from widths of 50 on up to 1150. The
results showed that the read and write speeds while rendering
were directly proportional to the number of pixels rendered.
Figure 3 is a graph of the read speeds while rendering various
square resolutions. The plot clearly follows a very linear trend.
Notice that there is a very large drop from the CUDA testing
program being run by itself to any rendering at all happening,
There is a several hundred MB/s drop in bandwidth the GLX
rendering is occurring.
Using this linear regression we can extrapolate that there
would be no bandwidth left while rendering at sizes upwards
of 12 MP, let alone 20 MP. This shows that read back
during rendering is undesirable and better frame rates can
only be achieved if rendering and reading are performed
synchronously one after another. The question became what
exactly is VirtualGL doing so the next step was looking at the
VirtualGL source code to see exactly what it was doing.
IV. VIRTUALGL
VirtualGL is software that will intercept OpenGL com-
mands from any program, it then will run all of the OpenGL
commands on the video card in a off screen buffer called
a PBuffer. Once each frame is rendered it reads the frame
from the PBuffer and brings it back to host memory (memory
available to CPU), now available to other software such as a
VNC server. VirtualGL is written mostly in C++ with a few
pieces of it in C. Already written into the code for VirtualGL
is a profiler to track the frame rate of VGL. The profiler will
time the read back, the blit (writing to X server), and the total
frame rate.
VirtualGL as well as a VNC server were compiled and
installed on a separate machine to do some testing of Vir-
tualGL speeds and so any code changes could be run in a test
environment instead of the working system. When running
the profiling on the testing machine (at low resolutions) it
confirmed that the issue was with read back, therefore read
back speed needed to be increased. VirtualGL calls an OpenGLfunction named glReadPixels. It is widely acknowledged that
glReadPixels does not perform nearly as fast as it should.
Performance testing showed that glReadPixels was not trans-
ferring at the full speed found earlier with CUDA. This means
if the transfer could be performed using CUDA rather than the
GL call it could run faster.
nVidia has been working on a set of CUDA functions to
allow interoperability between CUDA and OpenGL. Currently
it allows a OpenGL Pixel Buffer Object to be mapped to
a pointer available to CUDA calls. The problem with using
CUDA to read the frame is that the frame first has to be moved
from the PBuffer it was rendered into to a PBO (Pixel Buffer
Object). The only way to move the pixels from the PBufferto a PBO is using the OpenGL function glReadPixels. This
glReadPixels call may not be as slow as the original, since the
transfer is between GPU memory, which has high bandwidth
(greater than 10 GB/s). There was a similar example in the
nVidia CUDA SDK, the example rendered then would call
glReadPixels to move pixels to a PBO, then would map it
to a pointer readable by CUDA calls. In the nVidia example
glReadPixels would copy at a speed of 10-40 GB/s, as opposed
to copying to host memory which runs at less than 1 GB/s.
7/30/2019 Final Monk
3/12
nVidias example provided in the CUDA SDK requested
that it be rendered in a different pixels format than the other
programs used for testing VirtualGL, because it requested it
not only have RGB pixels but alpha also. The function con-
trolling these pixel formats is glXChooseVisual, which takes
a list of attributes for the window and produces an XVisual
structure to be used. Since VirtualGL already intercepts this
call only a simple change is required to replace any requested
attributes with the ones to increase performance. The increase
performance is seen because glReadPixels takes a parameter of
what format to read to pixels into, when the format requested
and the format of the buffer do not match a conversion must
be done creating a lot of overhead.
This process of mapping and un-mapping a PBO each read
cycle creates a lot of overhead. When dealing with a small
window, only a few mega-pixels, the overhead is so great that
no read back increase can be found. However when larger
resolutions are reached, specifically 20 MP, an increase in
read back rate can be found. When testing GoogleEarth at
full screen of 20 MP an increase in read back of 30% was
found, from the original 120MP/s in the testing environmentto a speed around 160 MP/s. To achieve this speedup several
things must be done to setup the process whenever a PBuffer
is created.
1) Before any CUDA GL calls can be made cudaGLSet-
GLDevice must be called, it is also good practice to call
cudaSetDevice to make sure which device is being used.
2) The PBO must be created and filled with data, this
can be done using glGenBuffers, glBindBuffer, and
glBufferData.
3) The PBO must be registered with CUDA using cud-
aGLRegisterBufferObject, upon exiting cudaGLUnreg-
isterBufferObject should be called.
Each time the frame is read several steps are also required
to make it possible.
1) The PBO must be bound to the pack buffer and filled
using glReadPixels.
2) The PBO must be unbound from the pack buffer and
mapped using cudaGLMapBuffer.
3) The data must be read from the PBO using cudaMem-
cpy.
4) The PBO must be unmapped using cudaGLUnmap-
Buffer.
The entire process from the users program to the X server is
shown in figure 4. Although this process is more complex it
is faster because CUDA is better is utilizing the bandwidthbetween the CPU and the GPU. This section of code for
VirtualGL can be found in Appendix B.
Once the frame was already being transferred through
CUDA, an attempt at transferring and blitting only the changes
required was attempted. A simple algorithm in which if there
was a change in a set of 512 pixels was found those 512 pixels
would be updated. The overhead associated with checking for
changes and transferring small groups of pixels is too large to
make this system worth using, but shows promise for solving
the problem associated with blitting large frames to X server.
Fig. 4. Flow for CUDA Enhanced Read back
V. VNC
When testing the performance of code changes in VirtualGL
a significant difference was found between the display wall
at the innovation center and the test environment, the test
environment was limited by the time it takes to read back,
whereas the display wall, at much higher resolution, is limitedby the blit speed. In the blitting process VirtualGL would make
a series of X calls to draw to the window then call XSync
which is what takes most of the time, so the current speed did
not appear to be due to VirtualGL not being fast enough.
At that time VNC server was also showing as taking 100%
of time of one of the CPUs, since the VNC server was
not multi-threaded it could not take any more time than
it was. According to a profiling of VNC server done in
the past, compression was taking the most amount of time.
Compression was enabled because with no encryption the
client machines were often crashing. Attempting raw mode
again was the next step to reduce CPU usage. By changing
the read and write network buffers, raw mode was able torun for several minutes. In the several minutes it performed at
about the same frame rate as before, however the VNC server
was taking much less CPU and the network usage had gone
up significantly (probably saturated during image transfer).
Since these two different transfer types had the same
performance it meant between there was likely a point with
balance on compression and network usage that would get
better performance. After a few tests, better performance could
be achieved by setting the JPEG quality setting on the VNC
clients to 3, which increased the frame rate to 2-3 frames per
second, with noticeable image quality degradation. Now the
CPU is not spending all of its time but the network is not at
full capacity either. Knowing that such a performance changecan be found in the VNC server the next logical step is running
it in parallel.
The VNC server being used handled each request sequen-
tially, in this environment 2 main clients were accessing the
server, each of the nodes having eight monitors. Since they
are handled sequentially the second client must always wait
for the first to finish for its request to be processed. Although
the VNC server is not using all of the CPU anymore it is still
using more than half the time, meaning it is quite busy.
7/30/2019 Final Monk
4/12
The VNC server being used was a program called xtightvnc,
a modified version of the original Xvnc. There is, to date, no
version of Xvnc that handles client requests in parallel. When
Xvnc is launched it handles all X Server setup required to
create a second display (e.g. display :1). There is another
common VNC server called x11vnc which connects to an
already existing display and allowing clients to view whatever
is on that display, this is most commonly used by users that
want to have remote access to display :0.
A simple test was performed in attempt to relieve the CPU
restrictions of serving clients sequentially. Instead of having
the client nodes connect to a single VNC server, the clients
connected to individual x11vnc servers, one to handle the top
half of the screen and the other to handle the bottom half.
The performance increase was easily noticeable by human
eyes. After this xtightvnc was no longer used and a X virtual
frame buffer was used to create display :1 with x11vnc
servers were setup for each client that wants to connect. The
same 2-3 frames per second could be achieved without the
quality degradation found by changing compression and losing
quality.There are many advantages to having multiple x11vnc
servers rather than a single Xvnc server, such as each client not
affecting the speed of another and being able to display only
sections of the screen. There are also disadvantages, one worth
noting is that clients are no longer synchronized in updates
and might have different frame rates. The other cause for
concern is the x11vnc can update the same time as VirtualGL
is updating the screen causing tearing or other strange affects
to the image.
VI . CONCLUSION
This project has found several factors affecting performance
of rendering on a monitor wall. A balance between networkand CPU usage for the VNC server needs to be found for
optimal performance, this is easiest by having multiple VNC
servers. Also found is that read back rate can be increased
noticeably by using CUDA to read pixels rather than reading
through OpenGL.
7/30/2019 Final Monk
5/12
APPENDIX
A. CUDA Memory Bandwidth Testing Program
#include #include #include #include
#include
#include
#define PINNED 0#define CPUGPU 1#define GPUGPU 2
#define CPUCPU 3#define PINTOUNPIN 4#define PINTOPIN 5#define GPINTOUNPIN 6
int memtest(int type, long long int bytes, double *to, double *from);void printtype(int type);
//#define SAMPLES 4#define MEMCPY 0
int main(int argc, char * argv[])
{
long long int bytes;double **toarray,**fromarray,*to,*from;long long int **bytesarray,*bytesr;int curtype;int i;double averageto, averagefrom;
int SAMPLES;
if (argc == 2) sscanf(argv[1],"%d",&SAMPLES);else SAMPLES = 3;cutilSafeCall(cudaSetDevice(0));
toarray = (double**)malloc(6*sizeof(double*));
fromarray = (double **)malloc(6*sizeof(double*));bytesarray = (long long int**)malloc(6*sizeof(int*));for ( i = 0 ; i < 6 ; i + + ) {
toarray[i] = (double *)malloc(SAMPLES*sizeof(double));fromarray[i] = (double *)malloc(SAMPLES*sizeof(double));
bytesarray[i] = (long long int *)malloc(SAMPLES*sizeof(long long int));}
for (curtype = 0; curtype < 6; curtype++) {i = 0 ;
to = toarray[curtype];
from = fromarray[curtype];
bytesr = bytesarray[curtype];
while ((i < SAMPLES)) {bytes = rand() % 153600000;
if (bytes < 1000000) continue;if (!memtest(curtype,bytes,to+i,from+i)) {
bytesr[i++] = bytes;
}
}
}
printf("Type:\t\t\tTo (MB/s)\t\tFrom (MB/s)\n");
for (curtype = 0; curtype < 6; curtype++) {
to = toarray[curtype];
7/30/2019 Final Monk
6/12
from = fromarray[curtype];
bytesr = bytesarray[curtype];
averageto = 0;
averagefrom = 0;
for (i = 0; i < SAMPLES; i++) {
averageto += ((*(bytesr+i)/1024.0/1024.0) / (*(to+i))*(1000000000.0));
averagefrom += ((*(bytesr+i)/1024.0/1024.0) / (*(from+i))*(1000000000.0));
}
averageto /= SAMPLES;averagefrom /= SAMPLES;
printtype(curtype);
printf("\t%7.2lf\t\t\t%7.2lf\n",averageto,averagefrom);
}
return 0;
}
void printtype(int type)
{
switch(type) {case PINNED:
printf("GPUTOPinned Memory: ");
break;case CPUGPU:printf("CPU to GPU Memory: ");
break;
case GPUGPU:printf("GPU to GPU Memory: ");
break;case CPUCPU:
printf("CPU to CPU Memory: ");
break;case PINTOUNPIN:
printf("CPUPIN to CPU Memory: ");
break;case PINTOPIN:
printf("PIN to PIN Memory: ");
break;
case GPINTOUNPIN:printf("GPU to CPUPIN to CPU Memory: \n");
break;}
return;}
int memtest(int type, long long int bytes, double *to, double *from){
int pinned;void *h,*d, *tmp;int i = 0 ;
int j;double rate1,rate2;struct timespec start, stop;
if (type == GPUGPU) {tmp = malloc(bytes);
if (!tmp) return 1;
d = NULL;
cudaMalloc(&d,bytes);
if (!d) {return 1;
}
h = NULL;
cudaMalloc(&h,bytes);
7/30/2019 Final Monk
7/12
if (!h) {
cudaFree(d);
return 1;}
for (i = 0; i < (bytes/sizeof(double)); i++) {
*((double*)tmp+i) = rand();}
cutilSafeCall(cudaMemcpy(h,tmp,bytes,cudaMemcpyHostToDevice));
clock_gettime(CLOCK_MONOTONIC,&start);
cutilSafeCall(cudaMemcpy(d,h,bytes,cudaMemcpyDeviceToDevice));
clock_gettime(CLOCK_MONOTONIC,&stop);
rate1 = ((stop.tv_nsec-start.tv_nsec));
clock_gettime(CLOCK_MONOTONIC,&start);
cutilSafeCall(cudaMemcpy(h,d,bytes,cudaMemcpyDeviceToDevice));
clock_gettime(CLOCK_MONOTONIC,&stop);
rate2 = ((stop.tv_nsec-start.tv_nsec));
printf("%lld\t%.2lf\t%.2lf\n",bytes,rate1,rate2);
free(tmp);
cudaFree(d);
cudaFree(h);
}if ((type == CPUCPU)||(type == PINTOUNPIN)||(type == PINTOPIN)) {if ((type == PINTOUNPIN)||(type == PINTOPIN)) {
cutilSafeCall( cudaMallocHost((void **)&h,bytes) );
} else {h = malloc(bytes);
}
if (!h) {return 1;
}
if ((type == PINTOPIN)) {
cutilSafeCall( cudaMallocHost((void **)&d,bytes) );} else {
d = malloc(bytes);
}
if (!d) {
if ((type == PINTOUNPIN)||(type == PINTOPIN)) cudaFreeHost(h);else free(h);return 1;
}
for (i = 0; i < (bytes/sizeof(double)); i++) {
*((double*)h+i) = rand();
}
clock_gettime(CLOCK_MONOTONIC,&start);
#if MEMCPYmemcpy(d,h,bytes);
#else
cutilSafeCall(cudaMemcpy(d,h,bytes,cudaMemcpyHostToHost));
#endif
clock_gettime(CLOCK_MONOTONIC,&stop);
rate1 = ((stop.tv_nsec-start.tv_nsec));
for (i = 0; i < (bytes/sizeof(double)); i++) {
*((double*)d+i) = rand();
}
clock_gettime(CLOCK_MONOTONIC,&start);
#if MEMCPYmemcpy(h,d,bytes);
#else
cudaMemcpy(h,d,bytes,cudaMemcpyHostToHost);
#endif
clock_gettime(CLOCK_MONOTONIC,&stop);
7/30/2019 Final Monk
8/12
rate2 = ((stop.tv_nsec-start.tv_nsec));
printf("%lld\t%.2lf\t%.2lf\n",bytes,rate1,rate2);
if ((type == PINTOUNPIN)||(type == PINTOPIN)) {
cudaFreeHost(h);
} else {
free(h);
}
if ((type == PINTOPIN)) {
cudaFreeHost(d);} else {
free(d);
}
}
if ((type == PINNED)||(type == CPUGPU)) {
if (type == PINNED) {pinned = 1;
} else {pinned = 0;
}
h = NULL;
if (pinned == 0) {
h = malloc(bytes);
} else {cutilSafeCall( cudaMallocHost((void **)&h,bytes) );}
if (!h) {
return 1;}
d = NULL;
cudaMalloc(&d,bytes);
if (!d) {free(h);
return 1;
}
for (j = 0; j < (bytes/sizeof(double)); j++) {
*((double*)h+j) = rand();}
clock_gettime(CLOCK_MONOTONIC,&start);
cutilSafeCall(cudaMemcpy(d,h,bytes,cudaMemcpyHostToDevice));
clock_gettime(CLOCK_MONOTONIC,&stop);
rate1 = ((stop.tv_nsec-start.tv_nsec));
clock_gettime(CLOCK_MONOTONIC,&start);
cudaMemcpy(h,d,bytes,cudaMemcpyDeviceToHost);
clock_gettime(CLOCK_MONOTONIC,&stop);
rate2 = ((stop.tv_nsec-start.tv_nsec));
if ((rate1 < 0) || (rate2 < 0)) return 1;
printf("%lld\t%.2lf\t%.2lf\n",bytes,rate1,rate2);
if (pinned == 0) {
free(h);} else {
cutilSafeCall(cudaFreeHost(h));
}
cudaFree(d);
}
*to = rate1;
*from = rate2;
return 0;}
7/30/2019 Final Monk
9/12
B. Changes to pbwin.cpp in VirtualGL
void pbwin::readpixels(GLint x, GLint y, GLint w, GLint pitch, GLint h,
GLenum format, int ps, GLubyte *bits, GLint buf, bool stereo){
static int zfq = 0;struct timespec start,stop;
GLint readbuf=GL_BACK;_glGetIntegerv(GL_READ_BUFFER, &readbuf);
tempctx tc(_localdpy, EXISTING_DRAWABLE, GetCurrentDrawable());
glReadBuffer(buf);
glPushClientAttrib(GL_CLIENT_PIXEL_STORE_BIT);
if(pitch%8==0) glPixelStorei(GL_PACK_ALIGNMENT, 8);else if(pitch%4==0) glPixelStorei(GL_PACK_ALIGNMENT, 4);else if(pitch%2==0) glPixelStorei(GL_PACK_ALIGNMENT, 2);else if(pitch%1==0) glPixelStorei(GL_PACK_ALIGNMENT, 1);
int e=glGetError();while(e!=GL_NO_ERROR) e=glGetError(); // Clear previous error
if ((!first)&&((cw!=w)||(ch!=h))) {
if (!cudafl) {cudafl = 1;
i = 1 ;
rrout.PRINT("[VGL] Resolution Change Attempting CUDA Acceleration (%d,%
d) to (%d,%d)\n",cw,ch,w,h);
}
cudachangesize(w,h);
cw = w;
ch = h;
}
if (first) {cudafl = 1;
}
if (cudafl) {if (first) {
static int go = 1;int i = 0 ;cw = w;
ch = h;
glewInit();
cudastart(w,h);
if (go) {
go = 0;
cutilSafeCall(cudaSetDevice(0));
cutilSafeCall(cudaGLSetGLDevice(0));
}
while (cudamakebuffer(x,y/(1
7/30/2019 Final Monk
10/12
_prof_rb.startframe();
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB,buffer);
clock_gettime(CLOCK_MONOTONIC,&start);
_glReadPixels(x, y, w, h, format, GL_UNSIGNED_BYTE, (GLvoid *)NULL);
clock_gettime(CLOCK_MONOTONIC,&stop);
rrout.PRINT("[VGL] glReadPixels took %lf ns\n",(double)stop.tv_nsec -
start.tv_nsec);if ((((double)stop.tv_nsec - start.tv_nsec) > 1000000)||(((double)stop.
tv_nsec - start.tv_nsec)
7/30/2019 Final Monk
11/12
mlib_ImageLookUp_Inp(image, (const void **)luts);
mlib_ImageDelete(image);
}
else
{
#endif
if(first){
first=false;
if(fconfig.verbose)rrout.println("[VGL] Using software gamma correction (
correction factor=%f)\n",
(double)fconfig.gamma);
}
unsigned short *ptr1, *ptr2=(unsigned short *)(&bits[pitch*h]);for(ptr1=(unsigned short *)bits; ptr1
7/30/2019 Final Monk
12/12
snprintf(_autotestframe, 79, "__VGL_AUTOTESTFRAME%x=%d", (unsigned int)_win,
_autotestframecount);
putenv(_autotestframe);
}
glPopClientAttrib();
tc.restore();
glReadBuffer(readbuf);
}