iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: http://www.top500.org/resources/frequently-asked-questions/
Frequently Asked Questions | TOP500

Frequently Asked Questions

This section contains frequently asked questions about the the TOP500 project and list. It is still in a very early stage and more questions with answers will be added shortly. If you have any suggestions please let us know.

General

What is the TOP500?

The Top500 list the 500 fastest computer system being used today. In 1993 the collection was started and has been updated every 6 months since then. The report lists the sites that have the 500 most powerful computer systems installed. The best Linpack benchmark performance achieved is used as a performance measure in ranking the computers. The TOP500 list has been updated twice a year since June 1993.

The Linpack Benchmark

What is the Linpack Benchmark?

The Linpack Benchmark is a measure of a computer’s floating-point rate of execution. It is determined by running a computer program that solves a dense system of linear equations. Over the years the characteristics of the benchmark has changed a bit. In fact, there are three benchmarks included in the Linpack Benchmark report.

The Linpack Benchmark is something that grew out of the Linpack software project. It was originally intended to give users of the package a feeling for how long it would take to solve certain matrix problems. The benchmark stated as an appendix to the Linpack Users' Guide and has grown since the Linpack User’s Guide was published in 1979.

What is the Linpack Benchmark report?

The Linpack Benchmark report is entitled “Performance of Various Computers Using Standard Linear Equations Software”. The report lists the performance in Mflop/s of a number of computer systems. A copy of the report is available at http://www.netlib.org/benchmark/performance.ps.

What is the reference for the Linpack Benchmark Report?

The Linpack Benchmark report should be referenced in the following way:

“Performance of Various Computers Using Standard Linear Equations Software”, Jack Dongarra, University of Tennessee, Knoxville TN, 37996, Computer Science Technical Report Number CS - 89 – 85, today’s date, url:http://www.netlib.org/benchmark/performance.ps.

Is there a paper which describes the benchmark in some detail and gives a historical perspective?

The paper “The LINPACK Benchmark: Past, Present, and Future” by Jack Dongarra, Piotr Luszczek, and Antoine Petitet provides a look at the details of the benchmark and provides performance data in graphics form for a number of machines on basic operations. A copy of the paper is available at http://www.netlib.org/utk/people/JackDongarra/PAPERS/hpl.pdf.

What is a Mflop/s?

Mflop/s is a rate of execution, millions of floating point operations per second. Whenever this term is used it will refer to 64 bit floating point operations and the operations will be either addition or multiplication. Gflop/s refers to billions of floating point operations per second andTflop/s refers to trillions of floating point operations per second.

What is the theoretical peak performance?

The theoretical peak is based not on an actual performance from a benchmark run, but on a paper computation to determine the theoretical peak rate of execution of floating point operations for the machine. This is the number manufacturers often cite; it represents an upper bound on performance. That is, the manufacturer guarantees that programs will not exceed this rate-sort of a "speed of light" for a given computer.  The theoretical peak performance is determined by counting the number of floating-point additions and multiplications (in full precision) that can be completed during a period of time, usually the cycle time of the machine. For example, an Intel Itanium 2 at 1.5 GHz can complete 4 floating point operations per cycle or a theoretical peak performance of 6 GFlop/s. 

What are the three benchmarks in the Linpack Benchmark report?

The three benchmarks in the Linpack Benchmark report are for Linpack Fortran n = 100 benchmark (see Table 1 for the report), Linpack n = 1000 benchmark (see Table 1 of the report), and Linpack’s Highly Parallel Computing benchmark (see Table 3 of the report).

What is the Linpack Fortran n = 100 benchmark?

The first benchmark is for a matrix of order 100 using the Linpack software in Fortran. The results can be found in Table 1 of the benchmark report. In order to run this benchmark download the file from http://www.netlib.org/benchmark/Linpackd, this is aFortran program. In order to run the program you will need to supply a timing function called SECOND which should report the CPU time that has elapsed. The ground rules for running this benchmark are that you can make no changes to the Fortran code, not even to the comments. Only compiler optimization can be used to enhance performance.

What exactly does the Linpack Fortran n=100 benchmark time?

The Linpack benchmark measures the performance of two routines from the Linpack collection of software. These routines are DGEFA and DGESL (these are double-precision versions; SGEFA and SGESL are their single-precision counterparts). DGEFA performs the LU decomposition with partial pivoting, and DGESL uses that decomposition to solve the given system of linear equations.

Most of the time is spent in DGEFA. Once the matrix has been decomposed, DGESL is used to find the solution; this process requires O(n2) floating-point operations, as opposed to the  O(n3) floating-point operations of  DGEFA. The results for this benchmark can be found in Table 1 second column under “LINPACK Benchmark n = 100” of the Linpack Benchmark Report.

What is the Linpack n = 1000 benchmark (TPP, Best Effort)?

The second benchmark is for a matrix of size 1000 and can be found in Table 1 of the benchmark report. In order to run this benchmark download the file from http://www.netlib.org/benchmark/1000d, this is a Fortran driver. The ground rules for running this benchmark are a bit more relaxed in that you can specify any linear equation solve you wish, implemented in any language. A requirement is that your method must compute a solution and the solution must return a result to the prescribed accuracy. TPP stands for Toward Peak Performance; this is the title of the column in the benchmark report that lists the results. 

Why is my performance results below the theoretical peak?

The performance of a computer is a complicated issue, a function of many interrelated quantities. These quantities include the application, the algorithm, the size of the problem, the high-level language, the implementation, the human level of effort used to optimize the program, the compiler's ability to optimize, the age of the compiler, the operating system, the architecture of the computer, and the hardware characteristics.  The results presented for this benchmark suites should not be extolled as measures of total system performance (unless enough analysis has been performed to indicate a reliable correlation of the benchmarks to the workload of interest) but, rather, as reference points for further evaluations.

Why are the performance results for my computer different than the same machine’s results in the Linpack Report?

There are many reasons why your results may vary from results recorded in the Linpack Benchmark Report. Issues such as load on the system, accuracy of the clock, compiler options, version of the compiler, size of cache, bandwidth from memory, amount of memory, etc can effect the performance even when the processors are the same. 

What is the Linpack’s “Highly Parallel Computing” benchmark?

The third benchmark is called the Highly Parallel Computing Benchmark and can be found in Table 3 of the Benchmark Report. (This is the benchmark use for the Top500 report). This benchmark attempts to measure the best performance of a machine in solving a system of equations. The problem size and software can be chosen to produce the best performance.

http://www.netlib.org/benchmark/hpl/ 

What are the ground rules for the first benchmark?

The “ground rules” for running the first benchmark in the report, n=100 case, are that the program is run as is with no changes to the source code, not even changes to the comments are allowed. The compiler through compiler switches can perform optimization at compile time. The user must supply a timing function called SECOND. SECOND returns the running CPU time for the process. The matrix generated by the benchmark program must be used to run this case.

What are the ground rules for the second benchmark?

The “ground rules” for running the second benchmark in the report, n=1000 case, allows for a complete user replacement of the LU factorization and solver steps. The calling sequence should be the same as the original routines.  The problem size should be of order 1000. The accuracy of the solution must satisfy the following bound:

(On IEEE machines this is 2-53 ) and n is the size of the problem. The matrix used must be the same matrix used in the driver program available from netlib.

What are the ground rules for the third benchmark?

The “ground rules” for running the third benchmark in the report, Highly Parallel case, allows for a complete user replacement of the LU factorization and solver steps. The accuracy of the solution must satisfy the following bound:

(On IEEE machines this is 2-53 ) and n is the size of the problem. The matrix used must be the same matrix used in the driver program available from netlib. There is no restriction on the problem size.

To what accuracy must be the solution conform?

The solution to all three benchmarks must satisfy the following mathematical formula:

(On IEEE machines this is 2-53 ) and n is the size of the problem. This implies the computation must be done in 64 bit floating point arithmetic. 

What numerical precision is required to run and benchmark and gain an entry in the Linpack Benchmark report?

In order to have an entry included in the Linpack Benchmark report the results must be computed using full precision. By full precision we generally mean 64 bit floating point arithmetic or higher. Note that this is not an issue of single or double precision as some systems have 64-bit floating point arithmetic as single precision. It is a function of the arithmetic used.

Can I get a more personalized list of machine and performance results?

You can get a more personalized listing of machines by using the interface at http://performance.netlib.org/performance/html/PDSbrowse.html

This list is not kept current however and may lag the Linpack benchmark report by months.

How can I get the Linpack Benchmark program?

You can download the programs used to generate the Linpack benchmark results by using the URL is http://www.netlib.org/benchmark/linpackd. This is a Fortran program. There is a C version of the benchmark located at: http://www.netlib.org/benchmark/linpackc. There is a Java version of the benchmark that can be downloaded as an applet at:

There is a Java program at:

http://www.netlib.org/benchmark/linpackjava/

Is there a Java version of the Linpack Benchmark?

There is a Java version of the benchmark that can be downloaded as an applet at:

There is a Java program at: http://www.netlib.org/benchmark/linpackjava/

What do I do to run the Linpack Benchmark Program?

For the 100x100 based Fortran version, you need to supply a timing function called SECOND. SECOND is an elapse timer function that will be called from Fortran and is expected to return the running CPU time in seconds. In the program two called to SECOND are made and the difference taken to gather the time.

How does the Linpack Benchmark performance relate to my application?

The performance of the Linpack benchmark is typical for applications where the basic operation is based on vector primitives such as added a scalar multiple of a vector to another vector. Many applications exhibit the same performance as the Linpack Benchmark. However, results should not be taken too seriously. In order to measure the performance of any computer it’s critical to probe for the performance of your applications. The Linpack Benchmark can only give one point of reference.  In addition, in multiprogramming environments it is often difficult to reliably measure the execution time of a single program. We trust that anyone actually evaluating machines and operating systems will gather more reliable and more representative data.

Are there errors in the Linpack Benchmark report?

While we make every attempt to verify the results obtained from users and vendors, errors are bound to exist and should be brought to our attention. We encourage users to obtain the programs and run the routines on their machines, reporting any discrepancies with the numbers listed here.

What is Linpack?

The Linpack package is a collection of Fortran subroutines for solving various systems of linear equations. (http://www.netlib.org/Linpack/) The software in Linpack is based on a decompositional approach to numerical linear algebra. The general idea is the following. Given a problem involving a matrix, one factors or decomposes the matrix into a product of simple, well-structured matrices which can be easily manipulated to solve the original problem. The package has the capability of handling many different matrix types and different data types, and provides a range of options. Linpack itself is built on another package called the BLAS. Linpack was designed in the late 70's and has been superseded by a package called LAPACK. 

How can I get the complete Linpack software collection?

The Linpack software library is available from netlib. See http://www.netlib.org/Linpack/

What are the BLAS?

The BLAS (Basic Linear Algebra Subprograms) are high quality "building block" routines for performing basic vector and matrix operations. Level 1 BLAS do vector-vector operations, Level 2 BLAS do matrix-vector operations, and Level 3 BLAS do matrix-matrix operations. Because the BLAS are efficient, portable, and widely available, they're commonly used in the development of high quality linear algebra software, LINPACK and LAPACK for example. For additional information see: http://www.netlib.org/blas/

Where can I get an optimized version of the BLAS?

The ATLAS (Automatically Tuned Linear Algebra Software) project is an ongoing research effort focusing on applying empirical techniques in order to provide portable performance for the BLAS routines. At present, it provides C and Fortran77 interfaces to a portably efficient BLAS implementation, as well as a few routines from LAPACK. For additional information see: http://www.netlib.org/atlas/

Is Linpack the most efficient way to solve systems of equations?

Linpack is not the most efficient software for solving matrix problems. This is mainly due to the way the algorithm and resulting software accesses memory.  The memory access patterns of the algorithm has disregard for the multi-layered memory hierarchies of RISC architecture and vector computers, thereby spending too much time moving data instead of doing useful floating-point operations. LAPACK addresses this problem by reorganizing the algorithms to use block matrix operations, such as matrix multiplication in the innermost loops. For each computer architecture block operations can be optimized to account for memory hierarchies, providing a transportable way to achieve high efficiency on diverse modern machines. We use the term “Transportable” instead of “portable” because, for fastest possible performance, LAPACK requires that highly optimized block matrix operations be already implemented on each machine. These operations are performed by the Level 3 BLAS in most cases.

What is LAPACK?

LAPACK is a software collection to solve various matrix problem in linear algebra. In particular, systems of linear equations, least squares problems, eigenvalue problems, and singular value decomposition. The software is based on the use of block partitioned matrix techniques that aid in achieving high performance on RISC based systems, vector computers, and shared memory parallel processors.

How can I get the whole LAPACK software collection?

LAPACK can be obtained from netlib, see (http://www.netlib.org/lapack/)

What is the history behind the Linpack Benchmark?

The Linpack Benchmark is, in some sense, an accident. It was originally designed to assist users of the Linpack package by providing information on execution times required to solve a system of linear equations. The first ``Linpack Benchmark'' report appeared as an appendix in the Linpack Users' Guide in 1979. The appendix comprised data for one commonly used path in Linpack for a matrix problem of size 100, on a collection of widely used computers (23 in all), so users could estimate the time required to solve their matrix problem.

Over the years other data was added, more as a hobby than anything else, and today the collection includes hundreds of different computer systems.

How can I add my computer's result to the table?

You can contact Jack Dongarra and send him the output from the benchmark program. When sending results please include the specific information on the computer on which the test was run, the compiler, the optimization that was used, and the site it was run on. You can contact Dongarra by sending email to dongarra@cs.utk.edu.

What is the SECOND function?

In order to run the benchmark program you will have to supply a function to gather the execution time on your computer. The execution time is requested by a call to the Fortran function SECOND. It is expected that the routine returns the accumulated execution time of your program. Two called to SECOND are made and the difference taken to compute the execution time.

How can I measure the execution time more accurately and reliably?

The Performance API (PAPI) project specifies a standard application programming interface (API) for accessing hardware performance counters available on most modern microprocessors. These counters exist as a small set of registers that count Events, occurrences of specific signals related to the processor's function. Monitoring these events facilitates correlation between the structure of source/object code and the efficiency of the mapping of that code to the underlying architecture.

For addition information see: http://icl.cs.utk.edu/projects/papi/

Should I run the single and double precision of the benchmarks?

The results reported in the benchmark report reflect performance for 64 bit floating point arithmetic. On some machines this may be DOUBLE PRECISION, such as computers that have IEEE floating point arithmetic and on other computers this may be single precision, (declared REAL in Fortran), such as Cray’s vector computers.

When and how often are the results updated in the benchmark report?

The benchmark report is updated continuously as new results arrive. They are posted to the web as they are updated. 

What matrix is used to run the benchmark?

The matrices are generated using a pseudo-random number generator. The matrices are designed to force partial pivoting to be performed in Gaussian Elimination.

What is HPL?

HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack Benchmark. 

For HPL What problem size N should I run ?

In order to find out the best performance of your system, the largest problem size fitting in memory is what you should aim for. The amount of memory used by HPL is essentially the size of the coefficient matrix. So for example, if you have 4 nodes with 256 Mb of memory on each, this corresponds to 1 Gb total, i.e., 125 M double precision (8 bytes) elements. The square root of that number is 11585. One definitely needs to leave some memory for the OS as well as for other things, so a problem size of 10000 is likely to fit. As a rule of thumb, 80 % of the total amount of memory is a good guess. If the problem size you pick is too large, swapping will occur, and the performance will drop. If multiple processes are spawn on each node (say you have 2 processors per node), what counts is the available amount of memory to each process. 

For HPL what block size NB should I use ?

HPL uses the block size NB for the data distribution as well as for the computational granularity. From a data distribution point of view, the smallest NB, the better the load balance. You definitely want to stay away from very large values of NB. From a computation point of view, a too small value of NB may limit the computational performance by a large factor because almost no data reuse will occur in the highest level of the memory hierarchy. The number of messages will also increase. Efficient matrix-multiply routines are often internally blocked. Small multiples of this blocking factor are likely to be good block sizes for HPL. The bottom line is that "good" block sizes are almost always in the [32 .. 256] interval. The best values depend on the computation / communication performance ratio of your system. To a much less extent, the problem size matters as well. Say for example, you empirically found that 44 was a good block size with respect to performance. 88 or 132 are likely to give slightly better results for large problem sizes because of a slightly higher flop rate.

For HPL what process grid ratio P x Q should I use ?

This depends on the physical interconnection network you have. Assuming a mesh or a switch HPL "likes" a 1:k ratio with k in [1..3]. In other words, P and Q should be approximately equal, with Q slightly larger than P. Examples: 2 x 2, 2 x 4, 2 x 5, 3 x 4, 4 x 4, 4 x 6, 5 x 6, 4 x 8 ... If you are running on a simple Ethernet network, there is only one wire through which all the messages are exchanged. On such a network, the performance and scalability of HPL is strongly limited and very flat process grids are likely to be the best choices: 1 x 4, 1 x 8, 2 x 4 ...

For HPL what about the one processor case ?

HPL has been designed to perform well for large problem sizes on hundreds of nodes and more. The software works on one node and for large problem sizes, one can usually achieve pretty good performance on a single processor as well. For small problem sizes however, the overhead due to message-passing, local indexing and so on can be significant.

For HPL why so many options in HPL.dat ?

There are quite a few reasons. First off, these options are useful to determine what matters and what does not on your system. Second, HPL is often used in the context of early evaluation of new systems. In such a case, everything is usually not quite working right, and it is convenient to be able to vary these parameters without recompiling. Finally, every system has its own peculiarities and one is likely to be willing to empirically determine the best set of parameters. In any case, one can always follow the advice provided in the tuning section of the HPL document and not worry about the complexity of the input file.

Can HPL be Outperformed ?

Certainly. There is always room for performance improvements. Specific knowledge about a particular system is always a source of performance gains. Even from a generic point of view, better algorithms or more efficient formulation of the classic ones are potential winners.

Can I use Strassen’s Method when doing the matrix multiples in the HPL benchmark or for the Top500 run?

The normal matrix multination algorithm requires n3 + O(n2) multiplications and about the same number of additions.  Strassen's algorithm reduces the total number of operations to O(n2.82) by recursively multiplying 2n × 2n matrices using seven n × n matrix multiplications. Thus using Strassen’s Algorithm will distort the true execution rate. As a result we do not allow Strassen’s Algorithm to be used for the TOP500 reporting. As a side note, in the "usual" matrix multiplication, we have an n2 error term. In Strassen's method, the error exponent p for npranges from 2-3.85 and the numerical error can be 10-100 times greater than that for standard multiplication.

Where can I get the software to generate performance results for the Top500?

There is software available that has been optimized and many people use to generate the Top500 performance results.  This benchmark attempts to measure the best performance of a machine in solving a system of equations. The problem size and software can be chosen to produce the best performance. A copy of that software can be downloaded from:

http://www.netlib.org/benchmark/hpl/

In order to run this you will need MPI and an optimized version of the BLAS. For MPI you can see:  http://www-unix.mcs.anl.gov/mpi/mpich/download.html and for the BLAS see: http://www.netlib.org/atlas/ .

Why would a machine appear in the Linpack Benchmark report but not in the Top500 list?

There could be two reasons. First the Linpack Benchmark report contains historic information. Even if a computer is no longer in existence it can appear in the Linpack benchmark report. This is unlike the Top500 which report the 500 fastest computers in existence at a given point in time. The second reason is that the Top500 list come out twice a year and the Linpack Benchmark report is updated continuously.

Why would a machine appear in the Top500 list and not in the Linpack Benchmark report?

If a machine is in the Top500 list it should appear in the Linpack Benchmark report. If you see an instance where this is not the case, its probably a mistake and please send email to Jack Dongarra dongarra@cs.utk.edu about the situation.

How can I interpret the results from the Linpack 100x100 benchmark?

When the Linpack Fortran n = 100 benchmark is run it produces the following kind of results:

 

       Please send the results of this run to:

 

 Jack J. Dongarra

 Computer Science Department

 University of Tennessee

 Knoxville, Tennessee 37996-1300

 

 Fax: 865-974-8296

 

 Internet: dongarra@cs.utk.edu

 

     norm. resid      resid           machep         x(1)          x(n)

  1.67005097E+00  7.41628980E-14  2.22044605E-16  1.00000000E+00  1.00000000E+00

 

 

    times are reported for matrices of order   100

      dgefa      dgesl      total     mflops       unit      ratio

 times for array with leading dimension of 201

  1.540E-03  6.888E-05  1.609E-03  4.268E+02  4.686E-03  2.873E-02

  1.509E-03  7.084E-05  1.579E-03  4.348E+02  4.600E-03  2.820E-02

  1.509E-03  7.003E-05  1.579E-03  4.348E+02  4.600E-03  2.820E-02

  1.502E-03  6.593E-05  1.568E-03  4.380E+02  4.567E-03  2.800E-02

 

 times for array with leading dimension of 200

  1.431E-03  6.716E-05  1.498E-03  4.584E+02  4.363E-03  2.675E-02

  1.424E-03  6.694E-05  1.491E-03  4.605E+02  4.343E-03  2.663E-02

  1.431E-03  6.699E-05  1.498E-03  4.583E+02  4.364E-03  2.676E-02

  1.432E-03  6.439E-05  1.497E-03  4.588E+02  4.360E-03  2.673E-02

 

The norm. resid is a measure of the accuracy of the computation. The value should be O(1). If the value is much greater than O(100) it suggest that the results are not correct.

The resid is the unnormalized quantity.

The term machep measure the precision used to carry out the computation. On an IEEE floating point computer the value should be 2.22044605e-16.

The values of x(1) and x(n) are the first and last component of the solution. The problem is constructed so that the values of solution should be all ones.

There are two sets of timings performed both on matrices of size 100. The first one is where the 2-dimensional array that contained the matrix has a leading dimension of 201, and a second set where the leading dimension 200. This is done to see what effect, if any, the placement of the arrays in memory has on the performance.

Times for dgefa and dgesl are reported. dgefa factors the matrix using Gaussian  elimination with partial pivoting and dgesl solves a system based on the factoriuzation. dgefa requires 2/3 n3 operations and dgesl requires n2 operations. The value of total is the sum of the times andmflops is the execution rate, or millions of floating point operations per second. Here a floating point operations is taken to be floating point additions and multiplications. Unit and ratio are obsolete and should be ignored.

If the time reported is negative or zero then the clock resolution is not accurate enough for the granularity of the work. In this case a different timing routine should be used that has better resolution. 

Do you have an archive of previous Linpack Benchmark reports or results?

No archive is maintained of previous results. However here is some information to provide a historical perspective.  The numbers in the following tables have been extracted from old Linpack Benchmark Reports.  It took a bit of ``file archaeology'' to put the list together since I don't have the complete set of reports.

Top Computers Over Time for the Linpack n=100 Benchmark

(Entries for this table began in 1979.)

 

Year

Computer

Number of

Processors

Cycle time

 

Mflop/s

2006

NEC SX-8/1 (1 proc)

1

2 GHz

2177

2004

Intel Pentium Nocona (1 proc 3.6 GHz)

1

3.6 GHz

1803

2003

HP Integrity Server rx2600 (1 proc 1.5GHz)

1

1.5 GHz

1635

2002

Intel Pentium 4 (3.06 GHz)

1

2.06 GHz

1414

2001

Fujitsu VPP5000/1

1

3.33 nsec

1156

2000

Fujitsu VPP5000/1

1

3.33 nsec

1156

1999

CRAY T916

4

2.2 nsec

1129

1995

CRAY T916

1

2.2 nsec

522

1994

CRAY C90

16

4.2 nsec

479

1993

CRAY C90

16

4.2 nsec

479

1992

CRAY C90

16

4.2 nsec

479

1991

CRAY C90

16

4.2 nsec

403

1990

CRAY Y-MP

8

6.0 nsec

275

1989

CRAY Y-MP

8

6.0 nsec

275

1988

CRAY Y-MP

1

6.0 nsec

74

1987

ETA 10-E

1

10.5 nsec

52

1986

NEC SX-2

1

6.0 nsec

46

1985

NEC SX-2

1

6.0 nsec

46

1984

CRAY X-MP

1

9.5 nsec

21

1983

CRAY 1

1

12.5 nsec

12

...

 

 

 

 

1979

CRAY 1

1

12.5 nsec

3.4

 

These numbers come from the Linpack Benchmark Report Table 1.

=====================================================================

 

Top Computers Over Time for the Linpack n=1000 Benchmark

(Entries for this table began in 1986.)

 

Year

Computer

Number of Processors

Cycle time

in nsec.

Measured

Mflop/s

Peak

Mflop/s

2006

NEC SX-8/8

8

2 GHz

75140

128000

2000

NEC SX-5/16

16

4.0

45030

64000

1995

CRAY T916

16

2.2

19400

28800

1994

Hitachi S-3800/480

4

2

16170

32000

1993

NEC SX-3/44R

4

2.5

15120

25600

1992

NEC SX-3/44

4

2.9

13420

22000

1991

Fujitsu VP2600/10

1

3.2

4009

5000

1990

Fujitsu VP2600/10

1

3.2

2919

5000

1989

CRAY Y-MP/832

8

6

2144

2667

1988

CRAY Y-MP/832

8

6

2144

2667

1987

NEC SX-2

1

6

885

1300

1986

CRAY X-MP-4

4

9.5

713

840

 

These numbers come from the Linpack Benchmark Report Table 1.

(Full precision; matrix size 1000; best effort programming, maximum optimization permitted.)

 

Top Computers Over Time for the Highly-Parallel Linpack Benchmark

 

(Entries for this table began in 1991.)

  Year

 

Computer

 

Number of

Processors

Measured

Gflop/s

Size of

Problem

Size of

1/2 Perf

Theoretical

Peak Gflop/s

2005-2006

IBM Blue Gene/L

131072

280600

1769471

 

367001

2002 - 2004

Earth Simulator Computer, NEC

5104

35610

1041216

265408

40832

2001

ASCI White-Pacific, IBM SP Power 3

7424

7226

518096

179000

11136

2000

ASCI White-Pacific, IBM SP Power 3

7424

4938

430000

 

11136

1999

ASCI Red Intel Pentium II Xeon core

9632

2379

362880

75400

3207

1998

ASCI Blue-Pacific SST, IBM SP 604E

5808

2144

431344

 

3868

1997

Intel ASCI Option Red (200 MHz Pentium Pro)

9152

1338

235000

63000

1830

1996

Hitachi CP-PACS

2048

368.2

103680

30720

614

1995

Intel Paragon XP/S MP

6768

281.1

128600

25700

338

1994

Intel Paragon XP/S MP

6768

281.1

128600

25700

338

1993

Fujitsu NWT

140

124.5

31920

11950

236

1992

NEC SX-3/44

4

20.0

6144

832

22

1991

Fujitsu VP2600/10

1

4.0

1000

200

5

 

These numbers come from the Linpack Benchmark Report Table 3.

(Full precision; the manufacture is allowed to solve as large a problem as desired, maximum optimization permitted.)

Measured Gflop/s is the measured peak rate of execution for running the benchmark in billions of floating point operations per second.

Size of Problem is the matrix size at which the measured performance was observed.

Size of ½ Perf is the size of problem needed to achieve ½ the measured peak performance.

Theoretical Peak Gflop/s is the theoretical peak performance for the computer.

What is the HPC Challenge benchmark?

The HPC Challenge benchmark consists at this time of 7 benchmarks: HPL, STREAM, RandomAccess, PTRANS, FFTE, DGEMM and b_eff Latency/Bandwidth. HPL is the Linpack TPP benchmark. The test stresses the floating point performance of a system. STREAM is a benchmark that measures sustainable memory bandwidth (in GB/s), RandomAccess measures the rate of random updates of memory. PTRANS measures the rate of transfer for larges arrays of data from multiprocessor’s memory. Latency/Bandwidth measures (as the name suggests) latency and bandwidth of communication patterns of increasing complexity between as many nodes as is time-wise feasible.

Where can I get additional information on the HPC Challenge benchmark?

For additional information on the benchmark see: http://icl.cs.utk.edu/hpcc/  

Is there a benchmark for sparse matrices?

The Linpack Benchmark suite is built around software for dense matrix problems. In May 2000 we started to put together a benchmark for sparse iterative matrix problems. For additional information see: http://www.netlib.org/benchmark/sparsebench/

Where can I get additional information on benchmarks?

For addition information on benchmarks see: http://www.netlib.org/benchweb/ 

Where can I send comments?

Please send your comments to Jack Dongarra at dongarra@cs.utk.edu.