Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Preparing for the Future of Computing
Pieter [email protected]
Friday 15 April 2016
Coffee & Data on Infrastructures
Broaden the view
Anticipate risks
Become aware of current issues
2/23
Message
In coming years, software will have to adapt to hardware morethan previously: Software Crisis
10101010101010101010101010101010101010101010 01010 10101010 0 0100 1 1 101010 10 0 1 01 1 0 0 0101010101010101 0101010101010 101010101010
3/23
My background
• PhD. on programmingmany-cores
• High-performancecomputing: front-runner
Programming Many-Cores
on
Multiple Levels of Abstraction
Pieter Hijma
4/23
Look into the Past
103
104
105
106
107
108
109
1010
1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
# t
ransi
stors
clock
speed (
MH
z)
Moore's law
# transistors
80088080
8086
8028680386
80486
Pentium
Pentium II
Pentium IIIPentium 4
Core 2
Core i7 (Nehalem)Core i7 (Sandy)
Core i7 (Haswell)
5/23
Look into the Past
103
104
105
106
107
108
109
1010
1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 0.1
1
10
100
1000
10000
# t
ransi
stors
clock
speed (
MH
z)
Moore's law
# transistors
80088080
8086
8028680386
80486
Pentium
Pentium II
Pentium IIIPentium 4
Core 2
Core i7 (Nehalem)Core i7 (Sandy)
Core i7 (Haswell)clockspeed (MHz)
5/23
Look into the Past
103
104
105
106
107
108
109
1010
1970 1975 1980 1985 1990 1995 2000 2005 2010 20150.1
1
10
100
1000
10000
# t
ransi
stors
clock
speed (
MH
z)
Moore's law
# transistors
80088080
8086
8028680386
80486
Pentium
Pentium II
Pentium IIIPentium 4
Core 2
Core i7 (Nehalem)Core i7 (Sandy)
Core i7 (Haswell)clockspeed (MHz)
single-core era
5/23
Look into the Past
103
104
105
106
107
108
109
1010
1970 1975 1980 1985 1990 1995 2000 2005 2010 20150.1
1
10
100
1000
10000
# t
ransi
stors
clock
speed (
MH
z)
Moore's law
# transistors
80088080
8086
8028680386
80486
Pentium
Pentium II
Pentium IIIPentium 4
Core 2
Core i7 (Nehalem)Core i7 (Sandy)
Core i7 (Haswell)clockspeed (MHz)
lucky time
5/23
Look into the Past
103
104
105
106
107
108
109
1010
1970 1975 1980 1985 1990 1995 2000 2005 2010 20150.1
1
10
100
1000
10000
# t
ransi
stors
clock
speed (
MH
z)
Moore's law
# transistors
80088080
8086
8028680386
80486
Pentium
Pentium II
Pentium IIIPentium 4
Core 2
Core i7 (Nehalem)Core i7 (Sandy)
Core i7 (Haswell)clockspeed (MHz)
multi-core era
5/23
Processor types
• Single-core• Optimized for latency
• Multi-core• Still optimized for latency, butjust more than one
• Many-core• Optimized for throughput• High performance/Watt
ALU ALU
ALU ALU
Control
Cache
ALU ALU
ALU ALU
Control
Cache
ALU ALU
ALU ALU
Control
Cache
ALU ALU
ALU ALU
Control
Cache
ALU ALU
ALU ALU
Control
Cache
6/23
Performance per Watt
• 1972 - 2007:• 10,000 × higher performance• 300 × higher efficiency
• 2005 - 2015• 250 × higher performance• 10 × higher efficiency
7/23
Performance per Watt
0
5000
10000
15000
20000
25000
30000
35000
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Performance and Power efficiency of #1 TOP500
Performance (TFLOPS)Efficiency (MFLOPS/W)
8/23
Many-core processors
features• throughput oriented• fast evolution of the architecture• architectural features for high performance
Difficult to program, especially for high-performance
9/23
Walls
• memory wall• energy wall
resulthardware without compromises to the interface toprogrammers → difficult to program →
• programming wall
10/23
Recap
To deal with energy problems hardware will be:• highly parallel• throughput oriented• architectural details for performance• difficult to program
ResultMore responsibility for software developers
11/23
Soft-/hardware
Traditional view• software is soft, we can change it easily• hardware is hard, we cannot change it
New view• software is expensive, big, long-lived and therefore hard,changing it is difficult
• hardware is inexpensive, short-lived and therefore soft,changing hardware is easy
12/23
Soft-/hardware
Traditional view• software is soft, we can change it easily• hardware is hard, we cannot change it
New view• software is expensive, big, long-lived and therefore hard,changing it is difficult
• hardware is inexpensive, short-lived and therefore soft,changing hardware is easy
12/23
Soft-/hardware
Traditional view• software receives less attention• hardware receives much attention• compiler can solve your performance problems
New view• software should gain more recognition• to adapt to future hardware, software has to become softagain
• compiler cannot help you as much
13/23
Examples
Parallel Ocean Program
• written in ’90s• different context: memory cheap, compute expensive
Linpack
• we still evaluate supercomputers with the same software• not representative anymore
14/23
How to prepare for this?
Abstractions• Programmer uses high-level abstractions• Compiler optimizes and translates to lower levels forcomputer
• Hardware extracts (limited) parallelism
This works reasonably well with:• a stable interface to hardware• limited parallelism on a processor
15/23
How to prepare for this?
In the many-core era?
• Interface to hardware not stable• Highly parallel, parallelism is explicit• Compilers are limited and cannot adapt fast enough
Abstractions• Also hide the architectural details necessary forperformance/efficiency
16/23
Programming
A program is an algorithm mapped to hardware
Program
Algorithm
Mapping
Hardware
SolutionIncorporate hardware descriptions in the programming model
17/23
Hierarchy of hardware descriptions
perfect
mic gpu
nvidia amd
fermi
xeon phi
gtx480
kepler
gtx680
controlperformance
portability
18/23
Role compiler: advisory
Compilers
• conservative• no application knowledge
Programmers
• application knowledge• can oversee correctness problems better
19/23
Optimization process recorded
perfect
mic gpu
nvidia amd
fermi
xeon phi
gtx480
kepler
gtx680
89 GFLOPSv1: 100 GFLOPSv2: 92 GFLOPSv3: 205 GFLOPS
205 GFLOPSv1: 494 GFLOPS
89 GFLOPS
FeedbackUsing 1/8 blocks per smp. Reduce the amount of sharedmemory used by storing/loading shared memory in phases
20/23
Making software soft again
MarverSemi-automatically transform legacy programs
• Ben van Werkhovenroot
POP_CommInitMessageEnvironment POP_Initialize
get_timertimer_start
step
timer_stop
override_time_flag
output_driver
exit_POP
timer_print_all
POP_ErrorPrint
POP_Initialize1
POP_ErrorSet
POP_Initialize_coupling POP_Initialize2
broadcast_array
POP_HaloUpdate
state
tavg_set_flag
float_global_to_local
set_surface_forcing
time_manager
tavg_forcing
passive_tracers_tavg_sflux movie_forcing diag_init_sums
dhdt
POP_Barrierbaroclinic_driver barotropic_driver
baroclinic_correct_adjust diag_global_preupdate
float_move_driver gather_float_fields
calc_float_field_values
diag_global_afterupdate
diag_print
diag_transport
error_check_time_flag
write_historywrite_moviewrite_restart write_tavgwrite_float_asciiwrite_float_netcdf
exit_message_environment abort_message_environment
timer_print
pop_init_phase1pop_init_coupler_comm cpl_interface_ibufRecvpop_send_to_coupler pop_init_phase2
init_communicate init_registry
init_io
init_context
POP_IOUnitsFlush
init_constants init_timers init_global_reductions
init_domain_blocks
init_grid1 init_domain_distribution
init_grid2
init_time1
init_state
init_topostress
init_horizontal_mix
init_tidal_mixinginit_vertical_mix POP_SolversInit
init_pressure_grad
init_baroclinic
init_barotropic
init_surface_hgt
init_prognostic
init_ice
init_ts
init_time2
init_forcing
pop_init_coupled
reset_registry_failure_count
broadcast_scalar
date_and_time shr_file_setIO
register_string
print_messagecreate_blocks
POP_BlocksCreate
horiz_grid_internalread_horiz_grid topography_internalread_topography create_local_block_ids
cf_area_avgcalc_tpointsvert_grid_internal
read_vert_grid
remove_pointssmooth_topography
read_bottom_cell
landmasks
area_masks
get_unit release_unit
init_time_flag
reset_switchesdocument
test_timestep
prior_days
init_state_coeffs
topo_stress_internal
read_ustar
init_del2u init_del4u
init_aniso
init_del2t init_del4t
init_gm
init_submeso
define_tavg_field
data_set destroy_io_field
init_vmix_const init_vmix_rich
init_vmix_kpp POP_ConfigOpen POP_ConfigRead POP_ConfigClosePOP_DistributionGet POP_RedistributeBlocks
define_movie_field define_float_field
destroy_file
read_restart
ymd2eday int_to_char
registry_err_check
leap_adjust
get_tday reduce_months
date2ymd
write_time_manager_options
init_ws init_shfinit_sfwf init_pt_interior init_s_interior init_ap
flushm
scatter_global
compute_dz
document_time_flagset_switches
hms unesco potem lsqsl2
smooth_topo2open_read_binary open_read_netcdfopen_binary open_netcdfclose_binary close_netcdfdefine_field_binarydefine_field_netcdfwrite_field_binary write_field_netcdfread_field_binary read_field_netcdf
add_attrib_file check_statusadd_attrib_io_fieldcheck_file_openwrite_array gather_globalread_array
compute_ccsm_var_viscosity
read_var_viscositywrite_var_viscosity
ugrid_to_tgrid
extract_attrib_file div find_forcing_timesget_forcing_filename find_interp_time echo_forcing_options
init_passive_tracers
init_advection
init_sw_absorption
pop_init_partially_coupled
init_qflux
init_diagnostics
init_outputinit_steptrap_registry_failure
POP_check
document_constants
set_tracer_indices
iage_init
dye_init
named_field_get_indexsw_absorb_frac set_chl_trn
ccsm_date_stamp
init_restartinit_history init_movie
init_tavg
init_floats
rest_read_tracer_blockfile_read_tracer_block
dye_init_sfluxdye_init_movie
read_field define_hist_field request_hist_field request_movie_fieldrequest_tavg_field
time_stampeval_time_flag
read_tavg
request_float_field
read_float_ascii read_float_netcdf
tavg_global
check
get_pt_interior_data get_s_interior_dataset_wsset_shfset_sfwf set_combined_forcing
tfreez
set_ap set_sflux_passive_tracersset_chl
reset_time_flags reduce_seconds
model_date
eval_time_flags
accumulate_tavg_field
tavg_coupled_forcing
update_movie_field
tgrid_to_ugrid
comp_flux_vel_ghost
vmix_coeffs
tracer_update tavg_passive_tracers movie_passive_tracersupdate_float_buffer
impvmixt
zcurl
clinic
impvmixu
cfl_check
grad
POP_SolversDiagonal POP_SolversRun
increment_tlast_ice impvmixt_correct
convad
ice_formation reset_passive_tracers
calc_float_weightswcalc float_move
calc_float_xyz_position float_field_value_T float_field_value_U_vec
float_field_value_U
POP_SolversGetDiagnostics
update_forcing_datainterpolate_forcing calc_shf_barnier_restoringcalc_shf_bulk_ncepcalc_shf_partially_coupled calc_shf_normal_yearcalc_sfwf_bulk_ncepcalc_sfwf_partially_coupled ms_balancing dye_set_sfluxnamed_field_get
interp_4pt
det
ocean_weights sen_lat_fluxprecip_adjustmenttmelt
reset_time_flag
ymd_hms
comp_flux_vel vmix_coeffs_const
vmix_coeffs_rich vmix_coeffs_kppcfl_vdiff
hdifft
advt vdifft
set_pt_interior set_s_interior set_interior_passive_tracersadd_kpp_sources
add_sw_absorb
dye_movie
advu gradphdiffu vdiffu
zero_ghost_cells
buoydiffri_iwmix ddmixbldepthblmix
sw_trans_chlwscale smooth_hblt
hdifft_del2hdifft_del4
hdifft_gm
advt_lw_limadvt_upwind3 advt_centeredcfl_advect
iage_set_interiordye_set_interior
cfl_hdiff
transition_layerkappa_lon_lat_vmhs kappa_lon_lat_hdgrkappa_lon_lat_dradius buoyancy_frequency_dependent_profilekappa_eg merged_streamfunction apply_vertical_profile_to_isop_hor_diff submeso
lw_limhupw3
hdiffu_del2hdiffu_del4 hdiffu_aniso
pcg ChronGear
btropOperator preconditioner
iage_reset dye_reset
create_suffix_hist create_suffix_movie_ccsmcreate_suffix_movie create_restart_suffix_ccsmcreate_restart_suffix write_restart_passive_tracerscreate_suffix_tavgcreate_suffix_float
21/23
Overview of people involved
Henri BalHigh Performance Distributed Computing
Cees de LaatSystem and Network Engineering
Ana VarbanescuPerformance Prediction
Ben van WerkhovenApplications
Ismail El-HelwProgramming Models
Alessio ScloccoRadio Astronomy
MeProgramming Models
Merijn VerstraatenGraph Processing
Souley MadougouPerformance Prediction
Clemens GrelckHigh-level Programming
Raphael PossHard-/software Interface
Catalin CiobanuReconfigurable Computing
22/23
Conclusion
In coming years, software will have to adapt to hardware morethan previously: Software Crisis
10101010101010101010101010101010101010101010 01010 10101010 0 0100 1 1 101010 10 0 1 01 1 0 0 0101010101010101 0101010101010 101010101010
23/23