Upload
chad-george
View
220
Download
4
Embed Size (px)
Citation preview
MSc. Miriel Martín Mesa, DIC, UCLV
High Performance Cluster in the UCLV
First steps
The idea
Installing a High Performance Cluster in the UCLV, using professional servers with open source operating system
¿Why?
The current researches require a large amount of computational resources that can not be obtained with a single computer.
The need to make several runs of the experiments, without having to wait to finish the current run to execute the next
The possibility of having an electric back that allows running jobs that require several days to finish
Current Hardware
7 nodes Dell R410 with: 2 Intel processors with 6 cores x processors 12 GB RAM, 250 GB hard drive 2 NIC Gbps,
10 Blade nodes Dell 1955 with: 2 Intel processors with 2 cores x processors 12 GB RAM, 36 GB HDD
Current Hardware
17 nodes with: 28 processors, 132 cores, 204 GB RAM, 1.3 TFLOPS (theoretical)
Cluster design
Beowulf design
Basic Software
• S/O: Debian 7• Resource manager: TorquePBS• Scheduler: MAUI• Central user authentication: NIS Server
Cluster installation (Master and nodes)
• PXE (Preboot eXecution Environment)• DHCP• TFTP • HTTP server• DNS Server (BIND)• Preseed script (Answers to installation questions)
Preseed code
d-i mirror/protocol string httpd-i mirror/country string manuald-i mirror/http/hostname string master.cluster.uclv.edu.cud-i mirror/http/directory string /debiand-i mirror/http/proxy stringd-i mirror/suite string wheezy
d-i partman-auto/disk string /dev/sdad-i partman-auto/method string regulard-i partman-auto/choose_recipe select atomicd-i partman-auto/purge_regular_from_device boolean trued-i partman-regular/confirm boolean trued-i partman/confirm_write_new_label boolean trued-i partman/choose_partition select Finish partitioning and write changes to diskd-i partman/confirm boolean true # If the system has free space you can choose to only partition that space.tasksel tasksel/first multiselect minimald-i pkgsel/include string openssh-server puppetd-i preseed/late_command string sed -i 's/no/yes/g' /target/etc/default/puppet
d-i mirror/protocol string httpd-i mirror/country string manuald-i mirror/http/hostname string master.uclv.cud-i mirror/http/directory string /debiand-i mirror/suite string wheezy
d-i partman-auto/disk string /dev/sdad-i partman-auto/method string regulard-i partman-auto/choose_recipe select atomicd-i partman-regular/confirm boolean trued-i partman/confirm_write_new_label boolean trued-i partman/choose_partition select Finish partitioning and write changes to diskd-i partman/confirm boolean
tasksel tasksel/first multiselect minimald-i pkgsel/include string openssh-server puppetd-i preseed/late_command string sed -i 's/no/yes/g' /target/etc/default/puppet
Cluster Management
Puppet Package management and configuration of the server and the nodes.
Cluster Management
Module: commons
Class packages-commons { $packages_commons = ["csh","flex","byacc","vim",tcsh","lsb", "lsb-core"] package { $packages_commons : ensure => installed }}
Cluster Management
Module: MPICH
class mpich ($mpich_version ) { file {mpich: path => "${mpich_path}", owner => root, mode => 775, ensure => directory, } exec { "mpich_configure": cwd => "${mpich_source}-${mpich_version}/", command => "nice -19 sh configure ${mpich_prefix} ${mpich_with_torque}", onlyif => "test ! -e ${mpich_source}-${mpich_version}/config.log", }……}
cron { update_ntpdate: command => "/usr/sbin/ntpdate ", user => root, minute => 0, hour => '*/1',}
service { cron: ensure => running, enable => true, }
Cluster Management
Monitoring tools
Ganglia
Provides real-time monitoring and execution environment
Monitoring tools
Icinga
Monitors any network resource, notifies the errors, generates performance data for reporting and reports the status of resources
System Access
System Access
System Access
Web page
System Access
Web page
System Access
Web page
System Access
Web page
Cluster applications
Cluster applications
Example
#!/bin/bash#PBS -N example1#PBS -l nodes=2:ppn=4#PBS -l walltime=01:20:00#PBS -q default#PBS -m ae#PBS -M [email protected] $PBS_O_WORKDIR############################module load mpich/3.0.4 mpirun ./application
Cluster queues
Queue nodes access CoresMemory (GB) jobs/users
Max Time(hours) Priority
Default
small 1Blade nodes 4 8 4 12 10
medium 1-3 Any 24 12 3 36 20
long 1-4 Any 24 15 2 168 30
To do
• Implement system of user quotas
• Add an external storage
• Continue installing applications
demanded by users
We always need to do more
Thank you
Muchas Gracias