Print this page

Title: Heavy Metal
Feature: #Grid
Date: 1 April 2008

Banks begin to pursue a third dimension in high-performance computing: using specialist accelerators in shared heterogeneous grids. By Bob Giffords

There's a new heavy metal sound reverberating down the corridors of banking IT departments since last year. Talk of hybrid grids and hardware accelerators is turning into action.

"We all started out with dedicated grids of Windows or Linux, for equities or fixed-income analytics," says Ryan Bagnulo, head of architecture and innovation at Wachovia Corporate and Investment Banking (CIB). The bank then consolidated this into a pool of high-end, low-power, multi-core Intel CPUs to shrink the footprint of the grid for greater efficiencies in power and cooling.

"However, we soon realized that heterogeneous grids comprised of both traditional blades and specialty appliances had real performance and economical advantages," he adds. So Wachovia now has pools of specialized resources dedicated to very different use cases, including Azul Systems engines for Java execution, IBM DataPower servers for accelerated enterprise service bus (ESB) integration and demilitarized zone (DMZ) firewalls and commodity Intel blades for much of the rest. "We might in the future add some blades with field-programmable gate array (FPGA) co-processors running algorithmic trading routines that execute on streaming market data with low-latency messaging capabilities," says Bagnulo.

Citi is also moving toward these hybrid solutions. "We see four quite separate roles for specialist accelerators," explains Dejan Polomcic, trading services architect in EMEA equities and prime finance technology at Citi. "Computational improvement for risk and analytics; low-latency trading, including both the market data piece and fast routing; low-latency messaging, including both connectivity and processing; and throughput-oriented large work streams, rich in data and volumes," he says. "Very different vendors are addressing each of these spaces."

Polomcic notes that other parts of the bank are already using accelerators. "In equities trading, we are currently working with a wide range of vendors and will probably be taking decisions later this year," he adds. "Key issues for us are software compatibility, ease of use for programmers and the sustainability of the vendors, which are often fairly small companies. The software stacks are definitely improving with more functionality and integration options."

Hardware vendors agree. "Accelerator technology has improved greatly over the past two years, both in terms of hardware compatibility and software support," says Jim Bovay, program manager for accelerators and multi-core at Hewlett-Packard. "But the various offerings are still evolving, in terms of power, cooling, form factors, connectivity and software."

The really glamorous accelerators are the high-speed floating-point engines for high-performance computing (HPC) applications. ClearSpeed Technology and IBM are both pioneering this vital sector.

THEATRICAL VIRTUOSITY

"Six years ago, we realized that clock speeds would inhibit future semiconductor growth and that specialist multi-core accelerators were the solution," explains Tom Beese, CEO at ClearSpeed. "Our first product had 96 cores, while the latest teraflop technology is the world's highest ever compute density, achieving a stunning 2.5 gigaflops per watt." He confirms that ClearSpeed's "terascale" system has just started shipping, with two customers taking delivery of teraflop performance in a 1U chassis just three months after the first public demonstration. "We're currently working with a small number of major banks in the UK, the US and Japan in proof-of-concept pilots," says Beese.

IBM's Cell Broadband Engine, originally designed for graphics acceleration, is now establishing its HPC credentials with the banks. "The Cell blade addresses two market requirements," says Derek Duerden, executive IT architect for financial markets and banking at IBM. "The just-in-case scenario of overnight analytics and scenario planning, and the just-in-time scenario for real-time, low-latency analytics and stream processing. Of course, with just-in-time, the trick is to not be just-too-late and waste the investment. Our customers see opportunities in both spaces."

Early tests of Cell showed significant improvements over conventional architectures, including 60-times improvements for Monte Carlo simulations. "Our latest Cell blade is now shipping for our pilot customers optimized for double precision floating point with 32 GB of memory on the blade. This should substantially improve the usability for our clients," says Duerden.

Both vendors are now focusing on the software ecosystem. ClearSpeed has opted for software integration with third-party applications like Wolfram Research's Mathematica, The MathWorks' Matlab and the Numerical Algorithms Group (NAG) libraries, but also provides support for in-house programmers.

"Software support for the Cell BE blade is improving rapidly," says Duerden. "Platform Computing, for example, has just announced an upgrade of its grid engine, Symphony, which can schedule work to all 18 cores of the blade. Combined with IBM's own financial libraries, they have shown elapsed time savings of up to 80 percent over conventional architectures on portfolio analytics and risk."

Besides libraries, compilers and Microsoft Excel integration technology, IBM is also upgrading its WebSphere product line. "Our end-to-end solution includes WebSphere MQ Low Latency Messaging on InfiniBand, conventional blades and IBM's Cell Broadband Engine blades for compute intensive analytics," says Folu Okunseinde, solutions architect for WebSphere front-office and low-latency messaging at IBM. "While InfiniBand is a network transport fabric, it really works more like shared memory, reaching down to the individual cores or processing elements on the chip." Clearly, complex HPC accelerators need these transport and memory features to avoid data starvation or saturation.

Okunseinde sees great opportunities. "If you opt for co-location, the latency differences become so miniscule that they are, for practical purposes, the same. Competitive advantage then falls to the complexity of the algorithms, which is where compute grids and the Cell BE blades come in with their high-performance, low-power output," he says.

A HEAVY BEAT

Low-latency trading is driving much of the demand for accelerators to meet the exploding challenge of market data. "Volumes in the US only took off in 2000 when equity market competition started in earnest and now number in the hundreds of thousands of messages per second," says Jeff Wells, vice president of product management at hardware ticker plant vendor Exegy. "If the Markets in Financial Instruments Directive (MiFID) is to be a success in Europe, we expect to see huge growth in volumes. Eurex, for example, is now a few tens of thousands [of messages] per second."

So Exegy developed a special FPGA-based accelerator for the ticker plant market using AMD dual and quad-core technology. "We support all of the top European as well as US exchange feeds," says Wells, noting that the vendor's FPGA accelerator is in production in both the US and Europe, while some other firms are at the evaluation stage. Besides data feed management, including watch lists, the appliance also offers a basket calculation engine for exchange-traded funds (ETFs), which automatically updates the basket based on prices for the underlying stocks.

The Exegy accelerator has already been benchmarked by the Securities Technology Analysis Center (STAC) in Chicago. The results are confirmed by Peter Lankford, director at STAC: "Our report on the FPGA-enabled Exegy Ticker Plant with InfiniBand showed that the full Options Price Reporting Authority (OPRA) feed at 1 million messages per second could be processed with an average latency of 80 microseconds and a 99th percentile latency of 150 microseconds."

Wells at Exegy says that a new model, coming out later this year, is "much, much faster."

A second supplier keeping up the FPGA beat is Activ Financial, based in Cambridge, UK. Internal tests of the vendor's ticker plant gave similar results to those of STAC. "Against a full live OPRA feed, we benchmark latencies averaging around 75 microseconds," says James Bomer, head of new technologies development at Activ Financial. "This is a full 10- to 20-fold advantage over current software systems, and with our new native InfiniBand Verbs and 1gE/10gE iWarp support we will be even faster." The Activ Financial appliance is also in trials with its clients. "Besides ultra-low latency, many clients are evaluating it as a serious way to reduce their hardware footprint," says Bomer. "This is the year we expect to see major adoption of this technology."

Pure software systems can be scaled sideways, but eventually the cost becomes prohibitive. "By using FPGA acceleration we actually see processing headroom outpacing data rate growth for the first time in years," says Bomer. "It gives us and our clients a clear, commercially sound technology roadmap."

Both Exegy and Activ Financial provide external application programming interfaces (APIs) to avoid the user programming challenges of FPGA.

AMPLIFYING SOUND

While the floating point and ticker plant accelerators are still largely in customer trials, one accelerator has achieved real market traction.

Bagnulo at Wachovia speaks from experience: "Azul has revolutionized Java processing by eliminating the garbage collection pause, which could last for seconds, thereby removing the 2 GB memory limits on Java Virtual Machines (JVMs) running on Intel. The largest Azul servers currently have 768 cores and now as much as three quarters of a terabyte of memory on a single appliance and multiple appliances can be clustered together, so there are in effect fewer CPU and memory limitations and we can get the advantages of both fast time to market and real-time trading performance."

Azul is designed to support transactional Java applications. "Examples of major deployments in financial services today include trading platforms, risk engines, real-time order matching and messaging gateways," says Syed Rizvi, managing director at Azul Systems Europe. "We take a mixed hardware-software approach to optimize performance and scalability of JVMs. Later we might consider further virtual machine support for .Net or other VM-based environments as they move to large-scale deployment."

Since launching its Java accelerator two-and-a-half years ago, Azul Systems has grown quickly. "Compared to general purpose servers, Azul can handle typically five times the throughput at roughly half the latency and only perhaps 20 percent of the datacenter footprint in terms of power, cooling and space," says Rizvi.

This scalability and performance are achieved through high levels of multi-threading and pauseless, hardware-assisted garbage collection. Even hundreds of threads are apparently not uncommon. "The architecture is built up from 48 core chips sharing a common memory in a flat, symmetric multiprocessing (SMP) configuration, with contention minimized by virtue of hardware-backed optimistic locking support," says Rizvi. "Consequently, Java applications run straight out of the box without modification."

At Wachovia, Azul devices are cleverly integrated into the grid. "We run most of our new implementations of JBoss and WebLogic on a hybrid infrastructure that is part Azul appliance and part Intel blade via Azul's JNI proxy technology," explains Bagnulo. "Occasionally, there is even a legacy integration from the Java code to older platforms like an AS/400 using a Program Call Markup Language (PCML) bridge to execute Cobol code using data passed from a Java-based Web user interface. This resolved performance problems we had for one of our business applications when the Web application server also ran on the AS400."

ORCHESTRATION IS KEY

Getting it all together is the challenge. "It's very risky to put all of your eggs in one basket, so the practice of abstraction, virtualization and service-oriented architecture (SOA) has caused IT executives to move away from stove pipes of dedicated servers for each application," says Bagnulo at Wachovia. "We now provision pools of specialized resources that have sufficient capacity to be partitioned across multiple traditional business applications and distributed IT." So the bank uses Azul for Java and Intel blades for C++ dynamic-link libraries (DLLs) that execute heavy analytical functions.

However, there are still many challenges. "The benefits are attractive but there is a huge entry hurdle in terms of the software piece," says Polomcic. "You have to do a lot of training and software conversion of legacy libraries, analytics, risk packages and quant models. You have to be pretty sure of sustainable benefits, notwithstanding the continuing improvements in standard Intel and AMD chips."

"We still see multi-core as the main thrust, since we shall soon move from quad cores to octals or hexadecacores using both scale-out and scale-up strategies," says Bovay at Hewlett-Packard, who is working with many of these vendors. "However, we now see a third dimension to boosting power using specialist accelerators in shared heterogeneous grids." So banks should now start to listen to the heavy beat of this important third dimension in grid technology.

Bob Giffords is an independent banking and technology analyst and can be reached at .

Source:

© Incisive Media Ltd. 2009
Incisive Media Limited, , is a company registered in the United Kingdom with company registration number