October 28, 2003 — 11:21 PM

TenCon Keynote: Srinidhi Varadarajan and the G5 Supercomputer

Dr. Srinidhi Varadarajan is the direcotr of the Terascale Computing Facility at Virginia Tech and is going to be taking us through the process of the creation of the Terascale Computing Facility. It's going to cover the why, the goals, the hardware and facilities, the software it's based on, the performance results and the research that's going to take place.

This all began early in 2003, bringing all the high performance computing people together. They wanted to build a world class program, and they needed to have a big ticket facility to go with it. Instead of having a grant/proposal structure for their students, they wanted to build a facility for them to use. They wanted to tie them into computational grids all over the nation. They want to treat supercomputers like electrical generation stations and using them in concert with visualization facilities and data storage centers making a much more intelligent.

Va. Tech is a part of the National Lambda Rail network, a huge fibre-optic network (15,000 miles +), a major network pipe for them to make use of with the new machine.

The Terascale Computing Facility was also a political success within the university, allowing many of the different departments to discuss and come together and share resources. They're all going in the same direction now.

Derrick Story: When did this all come together?

Dr. Varadarajan: This started in March of 2003. Within a month, they had financing. "We hope to continue on this."

This was built for dual usage, experimental and real world based.

High performance architectures was what they wanted, 64bit and up only. People don't pay enough attention to communication. Clusters use gigabit ethernet to talk to each other, SuperComputers uses Tightly coupled cores. That's what separates them. They wanted NLR, Internet and Internet 2 connectivity. It's become operational, and then ready for production runs this Fall.

Derrick Story: 64 bit is essential, how did you make the jump to the G5 and the Macintosh? You weren't using this before? Why here?

Dr. V: We're coming to that, please be patient.

Usage Goals:
Provide easy access for new investigators and exploratory research.
Support, collaborative multisite research activities
Will support on-demand access to computational cycles from external research partners.

They don't want to shut people out because they don't have a grant. This is about getting things together. The System can be strictly partitioned or it can be loosely handled.

The Future:
Conputational Sciences and Engineering isa long term intiative. The current facility will be followed by another one in 2006.

June 23rd:Apple announced the G5
June 26th:VT contacted Apple
Sept 5-11: G5's arrive
Sept 23rd: Facility began preliminary ops
Oct 1 - Nov: Performance optimization
Mid Nov.: Facility available for initial applications. Any user with operational HPC (MPI) codes can access the facility at this point.
Jan 1st: Facility available for full production use.

That's an incredible timeline, folks.

Here's the Hardware

Choosing the right architecture: limited budget and price/performance was the main consideration.

The total cost of the asset, including systems, memory, storage, primary and secondary communications fabrics and cables is $5.2mil. Facilities upgrade was $2mil. 1mil for the upgrades, 1mil for the UPS and generators. Arguably the cheapest world class supercomputer.

Definitely the most powerful student machine "Just that it runs is a big deal in itself"

Architecture Options:

Dell could deliver in mid-August. All Itanium II, but it fell through.
After that, AMD and IBM, Opteron systems. But the prices failed.
HP, same problem. All $9-10mil.

IBM couldn't deliver the 970 before January. But Apple could.

Don't design a machine for 18 months and then build it. Buy it and build it Right Then. Do it in 3 months.

The time was just right for the G5.

1100 Dual Apple G5 2Ghz CPU based nodes. Each node has 4GB of main memory and 160GB of Serial ATA storage. 176TB total secondary storage. 4 head nodes for compilations/job startup. 1 Management node.

"I came to the Mac by reading the kernel manual first." Dr. V did not use a mac before any of this.

Each G5 has 2 double precision FPUs. Each unit can complete 1 fused multiply add operation per cycle. This is the most common op in numerical computations. Thus, each processor can deliver 2 DP unites * 2 flops/cycle = 8GFlops. That's more than one Cray X1 Node. In a desktop. (shit.)

Each dual G5 can deliver a peak of 16GFlops of double precision performance leading to an Rpeak of 17.6TF.

Primary Comm Architecture.
Based on Infiniband tech. Switched Network. Each node connects into the network at 20Gbps full duplex. 24 96 port switches organized in a fat tree topology. Mellanox designed the switches and cards. They're using. Every node has a connection to every other node. It can support 150,000 connections per node. It's a very nice piece of hardware. less than 10ms latency.

18 leaf switches, each using 64 ports to nodes. 6 spine switches as a backplane, 32 ports per leaf switch interconnect to the spine switches. 5-6 ports per leaf switch are connected into each spine switch. Total switching capacity: 46Tbps. Um. Wow.

Why a half CBB design? 625MBps of duplex bandwidth when half the nodes simultaneously communicate to the other half. The full duplex bandwidth is the theoretical limit of the PCI-X bus.

They designed for the bus being the limit. Scientific appolications are not perfectly synchronous, and hence rarely encounter any bandwidth limitations from the half CBB design.

This is all above my head right now :D

Gigabit Ethernet management backplane/ Carries NFS, control job startup and typical IP traffic. It's based on five Cisco 4500 enterprise series switches. 240 Gigabit Ethernet ports/switch. Managed fabric with integrated IP traffic.

Facilities
How do you how this beast?

9000 sqft Data center, raised floor, environmental controlls, dual backbone, dual feeds and generators, fire suppresion.

They took 4000sqft for this. There's a 24/7 NOC right there.

3MW of power. Dual redundant with backup UPS. 2+ million BTUs of cooling capacity using Lierbet's extreme density cooling. This system uses rack-mounted heat exchangers with R-134a refrigerant and an overhead chiller.

Front to back cooling. Traditional AC would have resulted in a wind velocity of over 60 mph under the raised floor. They have a great wind tunnel. So, they used 270 of cold water (40degree) pumped in through huge pipes to chill refrigerant, then they go through copper pipes and through the hot aisles.

They've built a giant fridge. stays at 72 degrees. If it fails, it jumps to 100 degrees in 2 minutes. 2 minutes after, crispy G5s.

30 machines a day for a while.

2.5Ghz signalling in the Infiniband cable (looks like wide monitor cabling.) it's all copper cable. 20Gbit/sec off copper. Unreal.

They had to rebuild the area around the building with place.

Bring in a machine, power it up, then open it up, put in the infiniband card, then the RAM, then power it up. Then rack it. 2 hours total per machine.

Software

Runs Mac OS X 10.2.7 then Mellanox wrote Infiniband Drivers. They use MPI parallel comm libraries. C, C++ optimizing compilers IBM xlc and gcc 3.3

Fortran 95/90/77 compilers IBM xlf and Nagware

They rewrote a kext for cache optimized memory management. Ported MVAPICH to OS X, added message cache and dynamic memory management systems to improve performance. Scalable job startup system for MVAPICH.

Reliability

Supercomputers cased on commodity clusters face reliability concerns due to component numbers. They developed a transparent fault tolerance system - called déjà vu - for engineering reliability into large-scale supercomputers. VT is leading the collaboration with PSC and ISR. Déjà vu is being ported to the G5 platform, and will be deployed at the TCF, funded by the NSF. Currently working on a patent application.

The app should recover from any failure, because the system does it, transparent to the program. Apps shouldn't have to worry about this. The system does.

Salient features:
tgransparent checkpoint, revoery and migration system, it's kernel independent
New Model to achieve global state consistency
Incremental checkpointing
Non-Blocking checkpointing
Integrates user-initiated and system initiated checkpointing
Supports process migration

Communications
First version of the Mellanox driver and Verbs API was delivered in mid-August. Infiniband achieved 800MBps with MP performances 700MBps (MPI latency 8-14µs). Changes to PCI-X timing have increased Infiniband performance to 870MBps over the Verbs API.

There is translation between PCI-X and Main Memory, you go through an engine first. The card is full 64bit , but not from the PCI bus in the Mac.

The LINPACK benchmark

It solves a very large system of linear equations. Dense matrix operations.

500,000 variables.

They usedf the BLAS libraries, the core routine has GEMM efficiency of 84.1% (fairly phenomenal). Their benchmark used a mix of Goto's libs and Apple's veclib framework. IBM is nowhere near this good. Goto has the fastest library in pretty much every proc.

Currently they're at 9.555Teraflops. They want another 10% boost pretty quick, crossing the 10Teraflop line being the first academic machine to do so. That makes them #3. Worldwide. Period.

Q&A:

How much time did it take to develop the custom software?

2 months, 18 hours a day, almost all by himself.

When you first brought it up, did you stagger it?

Yep, otherwise they'd spike the power.

How are you handling data coming off the machine?

There's scratch space on each node, and they have external storage, plugged into the Infiniband mesh.

How much disk space?

We're not sure yet. 40-50 TB eventually.

How do you health check the nodes?

A daemon on each node. It does fault tolerance and such at startup, built into job scheduling

Examples of stuff that's run into the TCF?

Nanoscale Electronics
Quantum Chemistry
Computational Chemistry/Biochem
Aerodynamics
Cell Cycle Modeling
Molecular Statics
Computational Acoustics
and tons more I couldn't capture fast enough.

Would you attribute your success to the single source of code? Do you see the networking stuff impacting the private sector?

It wasn't just me, the coding was him, and there's a huge long list of apple folks helping out, as well as Mellanox, Liebert and Cisco, as well as Goto from the JPO, Dr. Panda and Andy Petit (OSU and UTK)

Major thanks to VT and everyone else.

They have people asking for clones as well as many many G5 clusters in the not too distant futures.

What's the status of the code behind this?

Is this going to be open source? Most of it will go back to OSU and their license style. The memory manager will go open source. Mellanox hasn't said, but most of their other stuff is open source.

What's the cost on Infiniband?

All the switches and cards $1.6 mil. $176k for the cables.

How did use the G5 instead of the Opteron or Itanium?

Both are fairly nice, but they're expensive. First, it didn't pass the price/performance ratio test. Opteron doesn't do what the G5 does. 4Gflops at peak, the G5 is twice that. The Itanium is phenomenally efficient, but only at 1.5Ghz, not the 2Ghz. The #4 is a 8.6Terafllop Itanium II cluster (on 2000 procs)

You built it all to 10.2.7, are you planning on upgrading to Panther?

They're upgrading to Panther in the next few weeks. The driver runs, the memory manager runs, everything else, no problem.

There's a lot of interest in departmental clusters, Is there documentation anywhere?

We hope to put up a full fledged package to duplicate this from 64 nodes and above. They hope to see many after this one.

How do you deal with Error Correction in Memory?

There's a lot of traffic on Ars Technica and other places. We do failure recovery, memory doesn't report. One of the things we've noticed is that failures aren't an issue yet. The reason they can be competent is the LINPACK test, which is showing 16 digits of accuracy. We are planning on moving to ECC systems in the future. They may have to run things twice for a bit.

How much coke and how many pizzas?

500-600 pizzas.

TrackBack URL for this entry:

http://www.typepad.com/services/trackback/6a00d8341ca44053ef00e550277b0a8833

Comments:

Great info in the article! How come it didn't get the deserved attention?
I linked it today from my website.

Posted by Arne Kuilman on December 20, 2003 — 12:51 PM


Sheer success of determination!

Posted by M V Chilukuri on May 15, 2004 — 4:07 AM


Sheer success of determination!

Posted by M V Chilukuri on May 15, 2004 — 4:08 AM


If you have a TypeKey or TypePad account, please Sign In