Final Monk

Embed Size (px)

Citation preview

  • 7/30/2019 Final Monk

    1/12

    Increasing Performance of a High-Resolution Monitor Wall

    Jason Monk

    Advisors: Prof. Bruce Segee

    Department of Electrical and Computer Engineering

    University of MaineOrono, USA

    [email protected]

    Abstract

    This paper will examine several aspects of renderingOpenGL graphics on large displays using VirtualGL andVNC. It will look at increasing performance in a coupleaspects of rendering on a monitor wall, in particular aCUDA enhanced version of VirtualGL as well as theadvantages to having multiple VNC servers. It will discussrestrictions caused by read back and blitting rates andhow they are affected by different sizes being rendered. ACUDA extension for VirtualGL was developed allowingfor faster read back at high resolutions.

    I. INTRO

    As we move into the digital age increase in display size

    is becoming more and more popular. The most commonly

    thought of way for this is to use a projector. A projector is

    a useful tool, however as you increase the size each pixel

    gets bigger, and the image looks worse and worse. A less

    common alternative to a projector is a monitor wall, several

    monitors set up in a grid. This solution allows large images

    while maintaining lots of detail, when adding more monitors

    the other monitors do not increase pixel size, as they would

    with a projector increase in size.There are several ways to achieve high-resolution data

    visualization on both a hardware a software layer. Although

    most high-resolution systems involve this monitor grid there

    are a few ways to distribute the image to each display. Some

    create the image on one computer and distribute the image

    to client machines afterwards, whereas others will have each

    image create its own piece of the image.

    The four by four monitor wall located at the UMaine

    innovation center is using a system where one computer

    creates the image and then it is distributed. To do this the

    computers are running two software packages, xtightvncserver

    to create a large display and distribute it, and VirtualGL to do

    rendering for applications run in the VNC server. Currentlythe setup can achieve frame rates of one to two fames per

    second while rendering close to full screen (about twenty mega

    pixels).

    When nVidia started producing the G8X series of cards they

    started implementing a architecture called CUDA, and most

    of their video cards since then have had CUDA support. With

    this new architecture they provided extensions for C/C++ that

    create an Application Programming Interface (API) allowing

    code to executed on the GPU. Since then the concept of

    GPGPU (general purpose graphics processing unit) has been

    growing, this is the concept that the GPU is very good a

    algebra and running things in parallel so we should take use

    of that power for other applications.

    Fig. 1. Display Wall at Innovation Center

    I I . PROJECT GOAL

    The goal of this project is to increase performance of the

    display wall at the innovation center, preferably by harnessing

    unused processing power available in the GPUs, through

    CUDA, of the computers hosting the wall.

    III . READ BACK RATE TESTING

    At the start of the project it was believed that the factor

    limiting the frame rate of the display was read back of the

    video card (the process of getting each frame from the video

    card to the CPU for distributing). There are no reliable read

    rates available on the internet of these cards so the rates needed

    to be tested.A program was written (Appendix A) that would use CUDA

    to write and read a large random amount of data to and

    from the video card memory and then later process the times

    collected, the program was later expanded to test several types

    of memory accessible from the function cudaMemcpy. The

    data collected showed that the Write Speeds were usually

    between 1700 and 2000 MB/s and the Read Speeds were only

    slightly slower between 1500 and 1800 MB/s, results shown in

    figure 2. When writing to the card the cudaMemcpy function

  • 7/30/2019 Final Monk

    2/12

    appeared to have a overhead of about 1.2 ms, when reading

    from the card was only 37 us. Using page-locked memory

    can significantly increase performance of both read and write

    speeds to over 3GB/s.

    Fig. 2. Read and Write Test on GTX260

    What had not been taken into account was that when reading

    back was occurring on the wall rendering may or may not also

    be taking place. So a new set of tests were done while running

    something requiring rendering on the video card, for this test

    glxgears was used because it was already set up on the testing

    machine and it was easy to run automated testing with.

    Fig. 3. Read Speeds while Rendering

    Read and write speeds were tested while rendering square

    glxgears windows, from widths of 50 on up to 1150. The

    results showed that the read and write speeds while rendering

    were directly proportional to the number of pixels rendered.

    Figure 3 is a graph of the read speeds while rendering various

    square resolutions. The plot clearly follows a very linear trend.

    Notice that there is a very large drop from the CUDA testing

    program being run by itself to any rendering at all happening,

    There is a several hundred MB/s drop in bandwidth the GLX

    rendering is occurring.

    Using this linear regression we can extrapolate that there

    would be no bandwidth left while rendering at sizes upwards

    of 12 MP, let alone 20 MP. This shows that read back

    during rendering is undesirable and better frame rates can

    only be achieved if rendering and reading are performed

    synchronously one after another. The question became what

    exactly is VirtualGL doing so the next step was looking at the

    VirtualGL source code to see exactly what it was doing.

    IV. VIRTUALGL

    VirtualGL is software that will intercept OpenGL com-

    mands from any program, it then will run all of the OpenGL

    commands on the video card in a off screen buffer called

    a PBuffer. Once each frame is rendered it reads the frame

    from the PBuffer and brings it back to host memory (memory

    available to CPU), now available to other software such as a

    VNC server. VirtualGL is written mostly in C++ with a few

    pieces of it in C. Already written into the code for VirtualGL

    is a profiler to track the frame rate of VGL. The profiler will

    time the read back, the blit (writing to X server), and the total

    frame rate.

    VirtualGL as well as a VNC server were compiled and

    installed on a separate machine to do some testing of Vir-

    tualGL speeds and so any code changes could be run in a test

    environment instead of the working system. When running

    the profiling on the testing machine (at low resolutions) it

    confirmed that the issue was with read back, therefore read

    back speed needed to be increased. VirtualGL calls an OpenGLfunction named glReadPixels. It is widely acknowledged that

    glReadPixels does not perform nearly as fast as it should.

    Performance testing showed that glReadPixels was not trans-

    ferring at the full speed found earlier with CUDA. This means

    if the transfer could be performed using CUDA rather than the

    GL call it could run faster.

    nVidia has been working on a set of CUDA functions to

    allow interoperability between CUDA and OpenGL. Currently

    it allows a OpenGL Pixel Buffer Object to be mapped to

    a pointer available to CUDA calls. The problem with using

    CUDA to read the frame is that the frame first has to be moved

    from the PBuffer it was rendered into to a PBO (Pixel Buffer

    Object). The only way to move the pixels from the PBufferto a PBO is using the OpenGL function glReadPixels. This

    glReadPixels call may not be as slow as the original, since the

    transfer is between GPU memory, which has high bandwidth

    (greater than 10 GB/s). There was a similar example in the

    nVidia CUDA SDK, the example rendered then would call

    glReadPixels to move pixels to a PBO, then would map it

    to a pointer readable by CUDA calls. In the nVidia example

    glReadPixels would copy at a speed of 10-40 GB/s, as opposed

    to copying to host memory which runs at less than 1 GB/s.

  • 7/30/2019 Final Monk

    3/12

    nVidias example provided in the CUDA SDK requested

    that it be rendered in a different pixels format than the other

    programs used for testing VirtualGL, because it requested it

    not only have RGB pixels but alpha also. The function con-

    trolling these pixel formats is glXChooseVisual, which takes

    a list of attributes for the window and produces an XVisual

    structure to be used. Since VirtualGL already intercepts this

    call only a simple change is required to replace any requested

    attributes with the ones to increase performance. The increase

    performance is seen because glReadPixels takes a parameter of

    what format to read to pixels into, when the format requested

    and the format of the buffer do not match a conversion must

    be done creating a lot of overhead.

    This process of mapping and un-mapping a PBO each read

    cycle creates a lot of overhead. When dealing with a small

    window, only a few mega-pixels, the overhead is so great that

    no read back increase can be found. However when larger

    resolutions are reached, specifically 20 MP, an increase in

    read back rate can be found. When testing GoogleEarth at

    full screen of 20 MP an increase in read back of 30% was

    found, from the original 120MP/s in the testing environmentto a speed around 160 MP/s. To achieve this speedup several

    things must be done to setup the process whenever a PBuffer

    is created.

    1) Before any CUDA GL calls can be made cudaGLSet-

    GLDevice must be called, it is also good practice to call

    cudaSetDevice to make sure which device is being used.

    2) The PBO must be created and filled with data, this

    can be done using glGenBuffers, glBindBuffer, and

    glBufferData.

    3) The PBO must be registered with CUDA using cud-

    aGLRegisterBufferObject, upon exiting cudaGLUnreg-

    isterBufferObject should be called.

    Each time the frame is read several steps are also required

    to make it possible.

    1) The PBO must be bound to the pack buffer and filled

    using glReadPixels.

    2) The PBO must be unbound from the pack buffer and

    mapped using cudaGLMapBuffer.

    3) The data must be read from the PBO using cudaMem-

    cpy.

    4) The PBO must be unmapped using cudaGLUnmap-

    Buffer.

    The entire process from the users program to the X server is

    shown in figure 4. Although this process is more complex it

    is faster because CUDA is better is utilizing the bandwidthbetween the CPU and the GPU. This section of code for

    VirtualGL can be found in Appendix B.

    Once the frame was already being transferred through

    CUDA, an attempt at transferring and blitting only the changes

    required was attempted. A simple algorithm in which if there

    was a change in a set of 512 pixels was found those 512 pixels

    would be updated. The overhead associated with checking for

    changes and transferring small groups of pixels is too large to

    make this system worth using, but shows promise for solving

    the problem associated with blitting large frames to X server.

    Fig. 4. Flow for CUDA Enhanced Read back

    V. VNC

    When testing the performance of code changes in VirtualGL

    a significant difference was found between the display wall

    at the innovation center and the test environment, the test

    environment was limited by the time it takes to read back,

    whereas the display wall, at much higher resolution, is limitedby the blit speed. In the blitting process VirtualGL would make

    a series of X calls to draw to the window then call XSync

    which is what takes most of the time, so the current speed did

    not appear to be due to VirtualGL not being fast enough.

    At that time VNC server was also showing as taking 100%

    of time of one of the CPUs, since the VNC server was

    not multi-threaded it could not take any more time than

    it was. According to a profiling of VNC server done in

    the past, compression was taking the most amount of time.

    Compression was enabled because with no encryption the

    client machines were often crashing. Attempting raw mode

    again was the next step to reduce CPU usage. By changing

    the read and write network buffers, raw mode was able torun for several minutes. In the several minutes it performed at

    about the same frame rate as before, however the VNC server

    was taking much less CPU and the network usage had gone

    up significantly (probably saturated during image transfer).

    Since these two different transfer types had the same

    performance it meant between there was likely a point with

    balance on compression and network usage that would get

    better performance. After a few tests, better performance could

    be achieved by setting the JPEG quality setting on the VNC

    clients to 3, which increased the frame rate to 2-3 frames per

    second, with noticeable image quality degradation. Now the

    CPU is not spending all of its time but the network is not at

    full capacity either. Knowing that such a performance changecan be found in the VNC server the next logical step is running

    it in parallel.

    The VNC server being used handled each request sequen-

    tially, in this environment 2 main clients were accessing the

    server, each of the nodes having eight monitors. Since they

    are handled sequentially the second client must always wait

    for the first to finish for its request to be processed. Although

    the VNC server is not using all of the CPU anymore it is still

    using more than half the time, meaning it is quite busy.

  • 7/30/2019 Final Monk

    4/12

    The VNC server being used was a program called xtightvnc,

    a modified version of the original Xvnc. There is, to date, no

    version of Xvnc that handles client requests in parallel. When

    Xvnc is launched it handles all X Server setup required to

    create a second display (e.g. display :1). There is another

    common VNC server called x11vnc which connects to an

    already existing display and allowing clients to view whatever

    is on that display, this is most commonly used by users that

    want to have remote access to display :0.

    A simple test was performed in attempt to relieve the CPU

    restrictions of serving clients sequentially. Instead of having

    the client nodes connect to a single VNC server, the clients

    connected to individual x11vnc servers, one to handle the top

    half of the screen and the other to handle the bottom half.

    The performance increase was easily noticeable by human

    eyes. After this xtightvnc was no longer used and a X virtual

    frame buffer was used to create display :1 with x11vnc

    servers were setup for each client that wants to connect. The

    same 2-3 frames per second could be achieved without the

    quality degradation found by changing compression and losing

    quality.There are many advantages to having multiple x11vnc

    servers rather than a single Xvnc server, such as each client not

    affecting the speed of another and being able to display only

    sections of the screen. There are also disadvantages, one worth

    noting is that clients are no longer synchronized in updates

    and might have different frame rates. The other cause for

    concern is the x11vnc can update the same time as VirtualGL

    is updating the screen causing tearing or other strange affects

    to the image.

    VI . CONCLUSION

    This project has found several factors affecting performance

    of rendering on a monitor wall. A balance between networkand CPU usage for the VNC server needs to be found for

    optimal performance, this is easiest by having multiple VNC

    servers. Also found is that read back rate can be increased

    noticeably by using CUDA to read pixels rather than reading

    through OpenGL.

  • 7/30/2019 Final Monk

    5/12

    APPENDIX

    A. CUDA Memory Bandwidth Testing Program

    #include #include #include #include

    #include

    #include

    #define PINNED 0#define CPUGPU 1#define GPUGPU 2

    #define CPUCPU 3#define PINTOUNPIN 4#define PINTOPIN 5#define GPINTOUNPIN 6

    int memtest(int type, long long int bytes, double *to, double *from);void printtype(int type);

    //#define SAMPLES 4#define MEMCPY 0

    int main(int argc, char * argv[])

    {

    long long int bytes;double **toarray,**fromarray,*to,*from;long long int **bytesarray,*bytesr;int curtype;int i;double averageto, averagefrom;

    int SAMPLES;

    if (argc == 2) sscanf(argv[1],"%d",&SAMPLES);else SAMPLES = 3;cutilSafeCall(cudaSetDevice(0));

    toarray = (double**)malloc(6*sizeof(double*));

    fromarray = (double **)malloc(6*sizeof(double*));bytesarray = (long long int**)malloc(6*sizeof(int*));for ( i = 0 ; i < 6 ; i + + ) {

    toarray[i] = (double *)malloc(SAMPLES*sizeof(double));fromarray[i] = (double *)malloc(SAMPLES*sizeof(double));

    bytesarray[i] = (long long int *)malloc(SAMPLES*sizeof(long long int));}

    for (curtype = 0; curtype < 6; curtype++) {i = 0 ;

    to = toarray[curtype];

    from = fromarray[curtype];

    bytesr = bytesarray[curtype];

    while ((i < SAMPLES)) {bytes = rand() % 153600000;

    if (bytes < 1000000) continue;if (!memtest(curtype,bytes,to+i,from+i)) {

    bytesr[i++] = bytes;

    }

    }

    }

    printf("Type:\t\t\tTo (MB/s)\t\tFrom (MB/s)\n");

    for (curtype = 0; curtype < 6; curtype++) {

    to = toarray[curtype];

  • 7/30/2019 Final Monk

    6/12

    from = fromarray[curtype];

    bytesr = bytesarray[curtype];

    averageto = 0;

    averagefrom = 0;

    for (i = 0; i < SAMPLES; i++) {

    averageto += ((*(bytesr+i)/1024.0/1024.0) / (*(to+i))*(1000000000.0));

    averagefrom += ((*(bytesr+i)/1024.0/1024.0) / (*(from+i))*(1000000000.0));

    }

    averageto /= SAMPLES;averagefrom /= SAMPLES;

    printtype(curtype);

    printf("\t%7.2lf\t\t\t%7.2lf\n",averageto,averagefrom);

    }

    return 0;

    }

    void printtype(int type)

    {

    switch(type) {case PINNED:

    printf("GPUTOPinned Memory: ");

    break;case CPUGPU:printf("CPU to GPU Memory: ");

    break;

    case GPUGPU:printf("GPU to GPU Memory: ");

    break;case CPUCPU:

    printf("CPU to CPU Memory: ");

    break;case PINTOUNPIN:

    printf("CPUPIN to CPU Memory: ");

    break;case PINTOPIN:

    printf("PIN to PIN Memory: ");

    break;

    case GPINTOUNPIN:printf("GPU to CPUPIN to CPU Memory: \n");

    break;}

    return;}

    int memtest(int type, long long int bytes, double *to, double *from){

    int pinned;void *h,*d, *tmp;int i = 0 ;

    int j;double rate1,rate2;struct timespec start, stop;

    if (type == GPUGPU) {tmp = malloc(bytes);

    if (!tmp) return 1;

    d = NULL;

    cudaMalloc(&d,bytes);

    if (!d) {return 1;

    }

    h = NULL;

    cudaMalloc(&h,bytes);

  • 7/30/2019 Final Monk

    7/12

    if (!h) {

    cudaFree(d);

    return 1;}

    for (i = 0; i < (bytes/sizeof(double)); i++) {

    *((double*)tmp+i) = rand();}

    cutilSafeCall(cudaMemcpy(h,tmp,bytes,cudaMemcpyHostToDevice));

    clock_gettime(CLOCK_MONOTONIC,&start);

    cutilSafeCall(cudaMemcpy(d,h,bytes,cudaMemcpyDeviceToDevice));

    clock_gettime(CLOCK_MONOTONIC,&stop);

    rate1 = ((stop.tv_nsec-start.tv_nsec));

    clock_gettime(CLOCK_MONOTONIC,&start);

    cutilSafeCall(cudaMemcpy(h,d,bytes,cudaMemcpyDeviceToDevice));

    clock_gettime(CLOCK_MONOTONIC,&stop);

    rate2 = ((stop.tv_nsec-start.tv_nsec));

    printf("%lld\t%.2lf\t%.2lf\n",bytes,rate1,rate2);

    free(tmp);

    cudaFree(d);

    cudaFree(h);

    }if ((type == CPUCPU)||(type == PINTOUNPIN)||(type == PINTOPIN)) {if ((type == PINTOUNPIN)||(type == PINTOPIN)) {

    cutilSafeCall( cudaMallocHost((void **)&h,bytes) );

    } else {h = malloc(bytes);

    }

    if (!h) {return 1;

    }

    if ((type == PINTOPIN)) {

    cutilSafeCall( cudaMallocHost((void **)&d,bytes) );} else {

    d = malloc(bytes);

    }

    if (!d) {

    if ((type == PINTOUNPIN)||(type == PINTOPIN)) cudaFreeHost(h);else free(h);return 1;

    }

    for (i = 0; i < (bytes/sizeof(double)); i++) {

    *((double*)h+i) = rand();

    }

    clock_gettime(CLOCK_MONOTONIC,&start);

    #if MEMCPYmemcpy(d,h,bytes);

    #else

    cutilSafeCall(cudaMemcpy(d,h,bytes,cudaMemcpyHostToHost));

    #endif

    clock_gettime(CLOCK_MONOTONIC,&stop);

    rate1 = ((stop.tv_nsec-start.tv_nsec));

    for (i = 0; i < (bytes/sizeof(double)); i++) {

    *((double*)d+i) = rand();

    }

    clock_gettime(CLOCK_MONOTONIC,&start);

    #if MEMCPYmemcpy(h,d,bytes);

    #else

    cudaMemcpy(h,d,bytes,cudaMemcpyHostToHost);

    #endif

    clock_gettime(CLOCK_MONOTONIC,&stop);

  • 7/30/2019 Final Monk

    8/12

    rate2 = ((stop.tv_nsec-start.tv_nsec));

    printf("%lld\t%.2lf\t%.2lf\n",bytes,rate1,rate2);

    if ((type == PINTOUNPIN)||(type == PINTOPIN)) {

    cudaFreeHost(h);

    } else {

    free(h);

    }

    if ((type == PINTOPIN)) {

    cudaFreeHost(d);} else {

    free(d);

    }

    }

    if ((type == PINNED)||(type == CPUGPU)) {

    if (type == PINNED) {pinned = 1;

    } else {pinned = 0;

    }

    h = NULL;

    if (pinned == 0) {

    h = malloc(bytes);

    } else {cutilSafeCall( cudaMallocHost((void **)&h,bytes) );}

    if (!h) {

    return 1;}

    d = NULL;

    cudaMalloc(&d,bytes);

    if (!d) {free(h);

    return 1;

    }

    for (j = 0; j < (bytes/sizeof(double)); j++) {

    *((double*)h+j) = rand();}

    clock_gettime(CLOCK_MONOTONIC,&start);

    cutilSafeCall(cudaMemcpy(d,h,bytes,cudaMemcpyHostToDevice));

    clock_gettime(CLOCK_MONOTONIC,&stop);

    rate1 = ((stop.tv_nsec-start.tv_nsec));

    clock_gettime(CLOCK_MONOTONIC,&start);

    cudaMemcpy(h,d,bytes,cudaMemcpyDeviceToHost);

    clock_gettime(CLOCK_MONOTONIC,&stop);

    rate2 = ((stop.tv_nsec-start.tv_nsec));

    if ((rate1 < 0) || (rate2 < 0)) return 1;

    printf("%lld\t%.2lf\t%.2lf\n",bytes,rate1,rate2);

    if (pinned == 0) {

    free(h);} else {

    cutilSafeCall(cudaFreeHost(h));

    }

    cudaFree(d);

    }

    *to = rate1;

    *from = rate2;

    return 0;}

  • 7/30/2019 Final Monk

    9/12

    B. Changes to pbwin.cpp in VirtualGL

    void pbwin::readpixels(GLint x, GLint y, GLint w, GLint pitch, GLint h,

    GLenum format, int ps, GLubyte *bits, GLint buf, bool stereo){

    static int zfq = 0;struct timespec start,stop;

    GLint readbuf=GL_BACK;_glGetIntegerv(GL_READ_BUFFER, &readbuf);

    tempctx tc(_localdpy, EXISTING_DRAWABLE, GetCurrentDrawable());

    glReadBuffer(buf);

    glPushClientAttrib(GL_CLIENT_PIXEL_STORE_BIT);

    if(pitch%8==0) glPixelStorei(GL_PACK_ALIGNMENT, 8);else if(pitch%4==0) glPixelStorei(GL_PACK_ALIGNMENT, 4);else if(pitch%2==0) glPixelStorei(GL_PACK_ALIGNMENT, 2);else if(pitch%1==0) glPixelStorei(GL_PACK_ALIGNMENT, 1);

    int e=glGetError();while(e!=GL_NO_ERROR) e=glGetError(); // Clear previous error

    if ((!first)&&((cw!=w)||(ch!=h))) {

    if (!cudafl) {cudafl = 1;

    i = 1 ;

    rrout.PRINT("[VGL] Resolution Change Attempting CUDA Acceleration (%d,%

    d) to (%d,%d)\n",cw,ch,w,h);

    }

    cudachangesize(w,h);

    cw = w;

    ch = h;

    }

    if (first) {cudafl = 1;

    }

    if (cudafl) {if (first) {

    static int go = 1;int i = 0 ;cw = w;

    ch = h;

    glewInit();

    cudastart(w,h);

    if (go) {

    go = 0;

    cutilSafeCall(cudaSetDevice(0));

    cutilSafeCall(cudaGLSetGLDevice(0));

    }

    while (cudamakebuffer(x,y/(1

  • 7/30/2019 Final Monk

    10/12

    _prof_rb.startframe();

    glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB,buffer);

    clock_gettime(CLOCK_MONOTONIC,&start);

    _glReadPixels(x, y, w, h, format, GL_UNSIGNED_BYTE, (GLvoid *)NULL);

    clock_gettime(CLOCK_MONOTONIC,&stop);

    rrout.PRINT("[VGL] glReadPixels took %lf ns\n",(double)stop.tv_nsec -

    start.tv_nsec);if ((((double)stop.tv_nsec - start.tv_nsec) > 1000000)||(((double)stop.

    tv_nsec - start.tv_nsec)

  • 7/30/2019 Final Monk

    11/12

    mlib_ImageLookUp_Inp(image, (const void **)luts);

    mlib_ImageDelete(image);

    }

    else

    {

    #endif

    if(first){

    first=false;

    if(fconfig.verbose)rrout.println("[VGL] Using software gamma correction (

    correction factor=%f)\n",

    (double)fconfig.gamma);

    }

    unsigned short *ptr1, *ptr2=(unsigned short *)(&bits[pitch*h]);for(ptr1=(unsigned short *)bits; ptr1

  • 7/30/2019 Final Monk

    12/12

    snprintf(_autotestframe, 79, "__VGL_AUTOTESTFRAME%x=%d", (unsigned int)_win,

    _autotestframecount);

    putenv(_autotestframe);

    }

    glPopClientAttrib();

    tc.restore();

    glReadBuffer(readbuf);

    }