Technologists at JPMorgan devise a desktop grid solution for the derivatives group that reuses idle resources, can scale potentially in the thousands, and cuts costs.
It is all about lowering costs, not just more power," says Francis Verdier, chief business technologist for rates and commodities in the exotics and hybrid group at JPMorgan. Yet the constant product innovation and customized solutions that have made the investment bank one of the consistent leaders in global derivatives rely on a relentless search for power. Behind the scenes, JPMorgan's technologists have found yet another rabbit to pull out of the hat. In this case, the rabbit is the "free" power from 1,000 scavenged PCs and expected cost savings of $1 million a year. The hat is the world-class, object-oriented risk and pricing system known as "Kapital." It is a case study that illustrates how systems evolve, adapting to changing requirements. It also shows how technology can help to address the desperate shortage of derivatives' knowledge by extending the lives of systems.
The Birth of Kapital
The story begins in the early 1990s, when JPMorgan first developed Kapital using the object-oriented language Smalltalk and a compatible database from GemStone Systems. "While the basics can be picked up in a couple of weeks, the system is rich and multifunctional," says Verdier. "And users really like its flexibility. The ease with which we can reuse components to model complex instruments ensures a rapid time to market."
Using Smalltalk enables developers to focus on the business problems, so technologists can later adopt different strategies for component implementation without changing the remaining source code, he says. Such flexibility and virtualization are essential since the business constantly needs to innovate. Pricing and risk models are increasingly sophisticated and therefore compute-intensive and some existing products will need to be processed for decades.
Enter JPMorgan's Compute Backbone (CBB), the investment bank's grid computing infrastructure spearheaded by Adrian Kunzle, CTO of the investment bank technology architecture group; Ty Panagoplos, CBB program manager; and Mike Ryan, chief technologist for the CBB. "Originally, Kapital ran on big iron boxes," recalls Ryan. "But the economics of that deployment don't scale. So they have moved onto CBB commodity Linux nodes in the hunt for cheaper compute, and now occupy over 5,000 of our CPUs on that platform, and they continue to grow at a dramatic pace," says Ryan.
Kapital has long been innovating to meet the ever-increasing compute demand, introducing a PC farm in 2002. However, that implementation would not scale beyond 350 machines. The CBB team offered a desktop grid solution potentially scaling in the thousands. "The limit will be the number of PCs we have available, not the software to manage them," says Ryan.
"The real bottom line for the PC farm is the free compute by reusing idle resources," says Verdier. "We will save $1 million per year." Nevertheless, many see the proposal as quite radical. Banks are wary of using scavenging for production and tend to use desktops only for development and user acceptance testing.
Making the Case
JPMorgan was no exception; there were many challenges to get all parties to buy into a PC farm performing mission-critical work. So the CBB team created a test farm in the summer of 2006 and re-engineering of Kapital began in earnest. The integration was completed within six months, followed by a few months of parallel production. In May of 2007, the first stage of live migration took place, transferring simpler foreign exchange and interest rate hybrids off Kapital blades and onto the CBB PC farm.
Ryan's team accomplished this by creating a production cluster of 1,000 Microsoft Windows PCs in London, with 1,500 more waiting in the wings. "Of course, at any one time, some PCs will be in use locally, and others may be switched off or dead," says Ryan. "An individual machine could be unavailable for any number of reasons, but overall the number available is statistically reliable. That means we can provide a service level agreement to the business on the collective computing power they can expect to have at their disposal."
The University of Wisconsin's open-source Condor software framework is used to control the action on the PCs, much like a screen saver. After five minutes of idle time where there is no keyboard or mouse activity, Condor registers the machine as available to the grid, updates the service code to obtain the proper set of analytics libraries, and begins pulling down tasks for execution. If a user returns to the machine while it's performing grid work, Condor immediately shuts down to avoid collateral impact. "It's been rock solid," says Ryan, pointing to the absence of Condor-related help desk calls.
"Kapital continues to run on the Linux CBB nodes with its own distribution software, known as PCS, required because of the data intensive strategy originally adopted," says Verdier. PCS optimizes portfolio calculations, for example, by grouping by exposures. "It's extremely smart though not fully automatic," he says. "It takes some manual intervention to assign portfolio components to nodes." Verdier also points to the initial heavy traffic between the GemStone database and the compute nodes as the system warms up, which puts stress on the gigabit ethernet connection linking the servers. The plan is to migrate more of the business onto true stateless computing on the standard CBB platform, using dedicated Linux x86 hosts with InfiniBand interconnect, and also the scavenged PCs.
Once Kapital is fully established on CBB, scheduling of compute resources will be completely automated. "Users will give us quality of service parameters to use as hints to the scheduler," says Ryan. "If latency is key, then jobs will run on blades in the local datacenter. For less time-critical work, we'll use the PCs or available resource across the wide area network." JPMorgan's exotics and hybrids team analyzes the historical run times of jobs to generate these parameters for the scheduler.
"Our approach to migration is to leverage the quants' effort to recode the analytics into stateless C/C++ libraries running on the backbone and replace the related Smalltalk code with calls to the library," says Verdier. In this way, the Smalltalk code can ignore the implementation issues of grid entirely. A small PCS installation of a few hundred CPUs will remain to support the Smalltalk GemStone database and those processes that are closely linked to it. "This will maintain the flexibility that developers have found so valuable," says Verdier. "It is the best of both worlds."
A New Chapter
Clearly, there are many challenges ahead, but the successful use of PC scavenging for production work has opened up a new chapter in the history of compute grids at JPMorgan. "We have had sharing on the CBB for well over a year, but it was allocated in large time chunks," notes Ryan. "But the techniques developed to scavenge cycles from PCs will also work on under-utilized datacenter blades, allowing us to be a lot more efficient with those machines."
With demand expected to grow enormously and datacenters reaching their physical limits, these teams need to keep finding new ways to stretch the budget dollars. "We want the business to grow and innovate, staying ahead of the competition, so we use technology as the weapon," says Verdier.
"Kapital has just kept reinventing itself over 13 years," Ryan says. "The combination of Kapital and the CBB will continue to be a very cost-effective solution for the foreseeable future." n
Bob Giffords is an independent banking and technology analyst and can be reached at .
Bob Giffords