Most IT organizations and CIOs focus their resources on the development of Business Applications and End-User support. There is still a relatively untapped opportunity providing research computing capabilities (“Grids”) as a core IT service, even if your users have not explicitly articulated a need. In this article I will discuss how the deployment of an internal Grid led to unanticipated issues of serendipity and became a critical asset of Columbia Business School, as well as discuss some steps that the CIO should consider when deploying this technology.
Research Computing Grids differ from typical business computing deployments due to their unique parallel processing architecture, fast interconnect speeds, and specialized software tools that enable users to run a variety of queries ranging from pattern recognition to advanced statistical analysis and regression. Jobs on the Grid can take anywhere from seconds to weeks to run, depending upon the complexity of the query and the volume of the data.
"Grids are appealing because of their unique architecture, which is geared for high performance"
When I started at Columbia Business School in 2012, there was a small Grid that was lightly used and was running less than 20,000 jobs per year. In 2016 the grid ran over 1,000,000 jobs and in 2017 we have run over 10,000,000 jobs. This usage illustrates that the Grid has been a valuable tool which faculty has embraced. The Grid has assisted their research, publications, and is now a differentiating enticement for recruiting new faculty.
Deploying a Grid is more than linking a large quantity of CPU’s and storage together. A successful deployment should include these key steps:
A Governance Model
They say King Arthur had a “round table” so that no man would sit at its head. CIO’s frequently make a critical error by establishing a hierarchical IT management model where customers do not have an adequate voice articulating the services that they need. An open and transparent partnership with all stakeholders is particularly important since the Grid will be a shared platform. We have a governance committee comprised of representatives of each faculty division at the school, along with myself and our key IT engineers. The group meets bimonthly and seeks transparent consensus on everything from where our infrastructure investments should be focused to which software tools should be purchased and how they are configured on the Grid.
Cloud vs. Internal
One of the first decisions that the CIO should make is whether their organization should develop the Grid internally, or use Cloud Services such as Amazon’s EC2. There are several factors that would come into play, but the bottom line is almost always cost. If usage is going to be rather minimal, or if there are insufficient internal resources available with expertise related to configuring and maintaining an internal Grid, a Cloud solution might be the best option. When usage becomes significant, an internal solution is usually best. As Cloud providers continually enhance their offerings and lower their prices, the price delineation point between Cloud vs. internal solutions will shift more in favor of the cloud. Also, if you have doubt about the future use of your Grid, it might be best to start off with a Cloud-based solution in order to minimize your organization’s investment and risk.
User Fee Structure
After deciding on a cloud or internal Grid solution, one of the critical decisions that a CIO will need to make is if the IT organization should make the entire Grid available to its users for free, a portion of the Grid available for free with a fee for certain premium services (e.g. such as storage or CPU use above a certain threshold). At Columbia Business School we took the position that the Grid would be a free resource funded entirely by the school’s overhead budget. This model resulted in a 50,000 percent increase in usage by the faculty over the course of five years.
I have worked at other institutions where this model would not make sense since there would be certain users that would monopolize all of the grid’s resources. In those cases, the CIO can charge a premium for the excess usage and guarantee the user a certain minimum amount of internal resources (CPU, storage, throughput, etc.). The latter example works particularly well when faculty members and researchers receive grants or external funding. A portion of the grant could be invested in purchasing additional resources for the Grid, the user is allocated a reserved minimum of the Grid’s resources (thus demonstrating appropriate stewardship or even tagging physical assets for the grant funding agencies), and in return the researcher would be getting more capacity than he could have ever achieved if he only used the grant money he was given, while the incremental resources that were purchased can be shared when not being used as part of the allocated minimum performance threshold.
Creating an Appropriate Architecture and Operation
Grids are appealing because of their unique architecture, which is geared for high performance. Every aspect of the grid’s design needs to take this into account. As an example, I’ve toured many supercomputing facilities and noticed that there are no vertical support beams in the datacenter floor area. The reason is that if the beam were present, interconnect cables might have to be snaked around the support beam adding several inches to their length and slowing down the transmission of the data by a trillionth of a second. When dealing with peta-scale I/O transactions, this reduction in throughput is impactful. Another architecture option is using GPU computing (Graphical Processing Units) by offloading arithmetically intensive problems to (essentially) a low cost gaming graphics board containing thousands of CPU cores, a 100x increase in processing speed for certain applications running GPU enabled software is possible. There are literally hundreds of factors that need to be taken into consideration during the design phase, so bring in experts that have done this before.
Probably the most significant factor for the success of our grid is our support personnel. I have one person dedicating 100 percent of their time to supporting the grid and its users. This individual acts not only as a troubleshooter, but also as a trusted research partner to our faculty and students. He is the “chief evangelist”, and articulates to users how this technology can solve the research problems that they deal with each day.
In summary, a Grid is a unique tool that can be utilized by faculty for a variety of research problems, and in ways that you cannot even imagine. Setting up the Grid’s architecture, governance, fee and support structure will likely determine its eventual adoption and success in your environment. At Columbia Business School, adoption and use of the grid has increased exponentially over the past five years and is now a positive differentiating asset, generating data and analysis that has been incorporated in numerous research projects and published papers.