Back to Stream Computing

Spec

NodeCPUx2, GPUx2, 12GB DDR3, 80GB SSD565 GFLOPS
CPUIntel Xeon E5645 2.40GHz (6 cores)25 GFLOPS
GPUTesla M2050 1.55GHz448SPs 3GB GDDR5515 GFLOPS

System organization

  • Number of Nodes : 16
  • Number of CPUs : 192
  • Number of GPUs : 32
  • Total Performance (Double precition): 8.496 + 9.040 = 17.536 TFLOPS
  • Total memory size : 384GB
  • Interconnection among nodes : Infiniband QDR

Photos

cluster0.jpg cluster1.jpg cluster2.jpg cluster3.jpg

How to use

Account for the cluster

You need an account in gpgpu.cs.tsukuba.ac.jp. (Ask Zhang san to make the account)

gpgpu is a gateway machine that can throw your MPI-based application the the cluster.

SSH to gpgpu

gpgpu accepts your ssh connection. After you login to gpgpu, you can have connections to the cluster nodes.

From gpgpu, you can access to the nodes using the folowing machine name;

gpgpue00
gpgpue01
gpgpue02
gpgpue03
gpgpue04
gpgpue05
gpgpue06
gpgpue07

Your home directry is shared with all the nodes via NIS.

Setup for MPI execution

PATH for MPI commands

At gpgpu, you need to add the following path to your PATH exvironment due to using commands for MPI at the machine;

/usr/lib64/openmpi/1.4-gcc/bin/

For example, add the path above in the .bashrc in your home directry as follows;

PATH=$PATH:/usr/lib64/openmpi/1.4-gcc/bin/

After you add the path above in the .bashrc, affect it to your current login environment using source command or just login to gpgpu again.

MANPATH for MPI related things

To refer man pages for MPI, in gpgpu, export the path as follows;

export MANPATH=/usr/lib64/openmpi/1.4-gcc/man/:$MANPATH

Try to see the man pages for mpirun and MPI_Send.

SSH key

The OpenMPI we use in the cluster invokes processes using SSH remotely. If you do not seup your SSH key to login automatically (such as rlogin without password), you need to type your password whenever any MPI process is created in any node in the cluster.

[yama@gpgpu ~]$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/yama/.ssh/id_rsa): &color(red,){(<= Just enter here)};
Created directory '/home/yama/.ssh'.
Enter passphrase (empty for no passphrase): &color(red,){(<= Just enter here)};
Enter same passphrase again: &color(red,){(<= Just enter here)};
Your identification has been saved in /home/yama/.ssh/id_rsa.
Your public key has been saved in /home/yama/.ssh/id_rsa.pub.

Then, just execute the following command lines to register the keys to be used at the SSH login;

[yama@gpgpu ~]$ cat .ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
[yama@gpgpu ~]$ chmod 600 $HOME/.ssh/authorized_keys

To test the SSH key setting is correct or not, just login to gpgpue00;

[yama@gpgpu ~]$ ssh gpgpue00
The authenticity of host 'gpgpue00 (192.168.2.1)' can't be established.
RSA key fingerprint is 34:d4:aa:8c:ef:49:5d:eb:f2:fc:37:ac:c2:4d:7d:19.
Are you sure you want to continue connecting (yes/no)? &color(red,){yes};
Warning: Permanently added 'gpgpue00,192.168.2.1' (RSA) to the list of known hosts.
Last login: Fri Feb 11 14:00:20 2011 from gpgpu-gw
[yama@gpgpu00 ~]$

SSH to all the nodes of the cluster from gpgpu00-07 once to get the finger print information fo al the nodes to login automatically. (Can we escape this if we use some techniques?? If you know any good method, please suggest it.)

MPI execution

Now you can use the environment for MPI program.

Let us try a sample program of pingpong enclosed in the OpenMPI package.

First download filepingpong.c (using wget command or send the file to the gpgpu) and save it any directory at gpgpu.

Compile the program using the mpicc command;

mpicc pingpong.c -o pingpong

Now you can find an executable file named "pingpong" in the directory.

It is the time execute it in the culuster nodes using mpirun.

You need to make a "machinefile" to specify machines where your program is distributed.

Edit a machinefile adding the nodes mentiond abve such as gpgpue00-07.

Here, we assume that the machinefile name is "machinefile" with following two machines;

gpgpue00
gpgpue01

Try to execute the pinpong program using mpirun as follows:

mpirun -prefix /opt/openmpi/1.4.2/ -machinefile machinefile -np 2 -nolocal pingpong

/opt/openmpi/1.4.2/ is the path to OpenMPI on the nodes.

In default, the processes uses Infiniband network for the MPI connection. You do not need to think about anything for the connection.

The output from the command line above is shown below;

I am gpgpu00.
I am gpgpu01.
Hello from 0 of 2
Hello from 1 of 2
Timer accuracy of ~1.907349 usecs

       8 bytes took        51 usec (   0.314 MB/sec)
      16 bytes took         6 usec (   5.369 MB/sec)
      32 bytes took         9 usec (   7.064 MB/sec)
      64 bytes took         6 usec (  21.475 MB/sec)
     128 bytes took       142 usec (   1.805 MB/sec)
     256 bytes took        13 usec (  39.768 MB/sec)
     512 bytes took        10 usec ( 102.261 MB/sec)
    1024 bytes took        13 usec ( 159.073 MB/sec)
    2048 bytes took        24 usec ( 170.098 MB/sec)
    4096 bytes took        21 usec ( 390.452 MB/sec)
    8192 bytes took        28 usec ( 587.346 MB/sec)
   16384 bytes took       324 usec ( 101.132 MB/sec)
   32768 bytes took       185 usec ( 354.224 MB/sec)
   65536 bytes took       242 usec ( 541.631 MB/sec)
  131072 bytes took       459 usec ( 571.175 MB/sec)
  262144 bytes took       583 usec ( 899.028 MB/sec)
  524288 bytes took      1082 usec ( 969.160 MB/sec)
 1048576 bytes took      1930 usec (1086.608 MB/sec)
 2097152 bytes took      3868 usec (1084.398 MB/sec)
 4194304 bytes took      7496 usec (1119.096 MB/sec)
 8388608 bytes took     14646 usec (1145.511 MB/sec)
 16777216 bytes took     30754 usec (1091.056 MB/sec)
 33554432 bytes took     61213 usec (1096.317 MB/sec)
 67108864 bytes took    120136 usec (1117.215 MB/sec)
 134217728 bytes took    239794 usec (1119.442 MB/sec)
 268435456 bytes took    476880 usec (1125.799 MB/sec)

  Asynchronous ping-pong

       8 bytes took        13 usec (   1.220 MB/sec)
      16 bytes took         6 usec (   5.369 MB/sec)
      32 bytes took         6 usec (  10.737 MB/sec)
      64 bytes took         5 usec (  25.565 MB/sec)
     128 bytes took        10 usec (  25.565 MB/sec)
     256 bytes took         8 usec (  65.075 MB/sec)
     512 bytes took        10 usec ( 102.261 MB/sec)
    1024 bytes took        12 usec ( 171.799 MB/sec)
    2048 bytes took        20 usec ( 204.522 MB/sec)
    4096 bytes took        21 usec ( 390.452 MB/sec)
    8192 bytes took        85 usec ( 193.032 MB/sec)
   16384 bytes took        61 usec ( 536.871 MB/sec)
   32768 bytes took        71 usec ( 925.515 MB/sec)
   65536 bytes took       125 usec (1049.152 MB/sec)
  131072 bytes took       232 usec (1128.862 MB/sec)
  262144 bytes took       462 usec (1134.687 MB/sec)
  524288 bytes took       861 usec (1217.958 MB/sec)
 1048576 bytes took      1750 usec (1198.378 MB/sec) 

  Bi-directional asynchronous ping-pong  

       8 bytes took         5 usec (   3.196 MB/sec)
      16 bytes took         4 usec (   7.895 MB/sec)
      32 bytes took         4 usec (  16.777 MB/sec)
      64 bytes took         4 usec (  31.581 MB/sec)
     128 bytes took         9 usec (  29.020 MB/sec)
     256 bytes took         8 usec (  65.075 MB/sec)
     512 bytes took         9 usec ( 113.025 MB/sec)
    1024 bytes took        11 usec ( 186.738 MB/sec)
    2048 bytes took        18 usec ( 226.051 MB/sec)
    4096 bytes took        19 usec ( 429.497 MB/sec)
    8192 bytes took        32 usec ( 509.033 MB/sec)
   16384 bytes took       286 usec ( 114.628 MB/sec)
   32768 bytes took       129 usec ( 508.092 MB/sec)
  65536 bytes took       227 usec ( 577.475 MB/sec)
  131072 bytes took       364 usec ( 720.047 MB/sec)
  262144 bytes took       521 usec (1006.418 MB/sec)
  524288 bytes took       967 usec (1084.331 MB/sec)
 1048576 bytes took      2062 usec (1017.125 MB/sec)

 Max rate = 1217.958048 MB/sec  Min latency = 1.907349 usec

OpenMP

gcc in each node supports OpenMP.

Here, we try an OpenMP program named "omptest".

Please download fileomptest.c and compile it using the commandline below;

gcc -fopenmp omptest.c -o omptest

Then just run the executable directly. It will be parallelized to the number of CPU cores (i.e. 4 threads).

[yama@gpgpu00 OpenMPTest]$ ./omptest
Hello World from 0
Hello World from 2
Hello World from 1
Hello World from 3
proc3: sum = 25
proc0: sum = 25
proc1: sum = 25
proc2: sum = 25
sum = 100

Enjoy!!

Written by Shinichi Yamagiwa


Attach file: filecluster1.jpg 1695 download [Information] filecluster3.jpg 1682 download [Information] filecluster2.jpg 1676 download [Information] filecluster0.jpg 1765 download [Information] fileGPU_Cluster2.JPG 1259 download [Information] fileomptest.c 1439 download [Information] fileGPU_Cluster1.JPG 1332 download [Information] fileGPU_Cluster3.JPG 1340 download [Information] filepingpong.c 1624 download [Information]

Front page   Edit Freeze Diff Backup Upload Copy Rename Reload   New List of pages Search Recent changes   Help   RSS of recent changes
Last-modified: 2012-09-18 (Tue) 05:46:29 (2274d)