Many organisations, including those in the HE/FE sector are finding that storage growth is increasing at an alarming rate and, when combined with a trend to require more servers to support storage, is leading to an unmanageable situation as far as storage management is concerned. The growth of distributed systems is also giving concern in many organisations as standards of support in a devolved environment are not always adequate. Consequently, consolidation of both servers and storage is looking very attractive.
1.1 Defining the Storage Technology: DAS, SAN, NAS
1.2 The Fabric – Switches, Fibre Channel, iSCSI technologies
1.3 Disk Technologies
1.4 Storage Arrays
2.0 Centralised Backup Systems
3.0 Strategic Fit and Industry Positioning
4.0 Data Growth Rates and Their Management
5.0 Storage Management
5.1 SAN Management
5.2 Storage Virtualisation
5.3 Storage Resource Management
5.4 SMIS/Bluefin Storage Management Initiatives
6.0 Data Categorisation Strategy
7.0 Fit of a SAN into a Data Categorisation and DR Strategy
8.0 E-Science/Grid Support
9.1 Reduced hardware capital costs
9.2 Reduced effort to manage storage
9.3 Increased productivity through improved fault tolerance and DR capability
9.4 24x7 Availability
9.5 More efficient backup
9.6 Scalable Storage
9.7 Interoperability between diverse systems
9.8 Centralised Management
10.0 Justification for SANs – Writing the Business Case
11.0 Risks/Issues
Networked storage solutions (of which SANs and NAS are examples – see below) can offer increased flexibility for connecting storage, ensuring much greater utilisation of disk storage space and support for server consolidation (as storage and server capacity growth trends are no longer linked).
Installing a SAN is large and complicated undertaking, needing institutional management commitment and is more suited to environments where a large proportion of the institution’s data will reside on the SAN. NAS can provide “plug and go” solutions for file serving, but SANs are better able to support large corporate databases and provide enhanced resilience.
1.0 The Technology
1.1 Defining the Storage Technology: DAS, SAN, NAS
Traditionally, data storage resides on hard disks that are locally attached to individual servers. This is known as Direct Attached Storage (DAS). Although this storage may now be large (in the order of 100s of Gigabytes of data storage per server) the storage is generally only accessible from the server to which it is attached. As such, much of this disk space remains unused and plenty of ‘contingency’ has to be built into storage needs when determining server specification. In addition, if the server were to fail, access to the data held on those local disks is generally lost.
A Storage Area Network (SAN) is a separate “network” dedicated to storage devices and at minimum consists of one (or more) large banks of disks mounted in racks that provide for ‘shared’ storage space which is accessible by many servers/systems. Other devices, such as robotic tape libraries may be attached to the SAN. See Figure 1 for a representation of both DAS and SAN storage.
Network Attached Storage (NAS) is storage that sits on the ordinary network (or LAN) and is accessible by devices (servers and workstations) attached to that LAN. NAS devices provide access to file systems and as such are effectively file server appliances. Delivery of file systems is most commonly via NFS (Network File System) or CIFS (Common Internet File System) protocols, but others may be used e.g. NCP (NetWare Core Protocol). These file systems require some sort of associated authentication system to check permissions for file access.
A SAN functions as a high-speed network similar to a conventional local area network (LAN) and establishes a direct connection between storage resources and the file server infrastructure. The SAN effectively acts as an “extended storage bus” using the same networking elements of a LAN including routers, bridges, hubs and switches. Thus, servers and storage can be ‘de-coupled’ allowing the storage disks to be located away from their host servers. The SAN is effectively transparent to the server operating system, which “sees” the SAN attached disks as if they were local SCSI disks. Figure 1 also shows the attachment of storage arrays and tape libraries via switches.
A dedicated SAN carries only “storage data”. This data can be shared with multiple servers without being subject to the bandwidth constraints of the “normal network” (LAN). Practically, a SAN allows for data to be managed centrally and to assign storage “chunks” to host systems as required.
The main benefit of NAS devices is ease of deployment - most devices offering a “plug and play” capability, being designed as single purpose “appliances”. Modern NAS appliances can also serve large amounts of data with internal storage capacity measured in Terabytes. Some NAS appliances are limited by the authentication schemes supported and NetWare users in particular should seek clarification from vendors over compatibility issues.
Many regard SAN and NAS as competitors, but in reality they are complementary technologies – SAN delivering effective block-based input/output, whilst NAS excels at file based input/output (usually via NFS or CIFS). A hybrid device called a NAS Head or a NAS Gateway has storage that resides in the storage arrays attached to a SAN whilst still delivering file systems over the LAN. A combination of a SAN with NAS Gateways may be an effective way for sites to deliver file-based functionality e.g. for user home directories.
In fact, DAS still has an ongoing use for many purposes – the cost of connecting servers to the SAN can be high and for systems like DNS servers, for example, where redundancy is provided by other means (multiple equivalent servers), then highly critical data can be resident on the direct attached disks of the servers.
The world of storage is rapidly changing and interested parties are advised to keep monitoring useful storage-related web sites [1].
1.2 The Fabric – Switches, Fibre Channel, iSCSI technologies
The fabric for a SAN provides the connectivity between the host servers and the storage devices. The dominant architecture for SANs is based on Fibre Channel (FC) [2], that whilst expensive, does have advantages in terms of its connectivity options. Compared to SCSI devices, for example, many more storage devices may be connected over much larger distances with higher data transfer rates.
Cables to connect SAN components are of three types: copper, short-wave fibre or long-wave fibre. Copper cables are only suitable for short connections (less than 12m), whilst short-wave fibre (multi-mode 50 micron) is used for distances up to 500m with long-wave fibre (single-mode 9 micron) needed for longer distances up to 10 Km [3]. A Fibre Channel transceiver unit called a GBIC (Gigabit Interface Connector) is then needed to connect the FC cables to the FC devices. Two types of GBIC are available: short-wave (for distances up to 500m) or long-wave (for distances up to 10Km) [4].
In Fibre Channel topologies the host server may be connected via a Host Bus Adapter (HBA) to the storage directly, via a hub or a switch. Direct connections to storage do not constitute a “network” and so are not used in SANs. Hubs use a topology known as Fibre Channel Arbitrated Loop (FC-AL) that shares the loop’s total bandwidth amongst all attached FC devices with a total device restriction of 126 attached FC devices. Switches provide a set of multiple dedicated, fully-interconnected, non-blocking data paths known as a “fabric”.
Switches provide simultaneous routing of device traffic and are capable of supporting a theoretical maximum of 16 million FC devices. SANs of any degree of complexity should be therefore be based on a switched fabric using Fibre Channel switches with varying numbers of ports – typically 8, 16, 24 or 32 ports. Switches with a large numbers of ports (typically 64 or above) are also available with additional fault-tolerant features and are known as “directors”. Director-class switches are very expensive, but are the best solution for large scale SANs that will have a large number of host servers attached. Switches can be cascaded and linked together, but the available port count can soon be diminished by the requirement for Inter-Switch Links (ISLs). For a large SAN an ideal solution would be large directors at the centre of the fabric with smaller switches connected off the director via ISLs – the so-called “core-to-edge” solution [5].
Interoperability between elements of a SAN fabric needs careful investigation and it may be prudent to stick with just one vendor for the provision of switches.
Figure 2 A schematic illustrating how no single point of failure can be achieved through dual-pathing to each SAN component using separate fabrics ‘A’ and ‘B’. Even more resilience may be achieved through the use of two nodes located in separate sites. In reality, storage arrays and tape libraries would be connected with more than just one fibre connection to each switch.
To maximise the benefits of a SAN, ideally dual redundant fabrics should be used, meaning that each server has two HBAs, each attached to different Fibre Channel switches that then have separate connectivity to the storage arrays. With suitable additional software installed, such a dual-hosted environment can also be used to provide dual-pathing with automated path failover or even path load balancing (see Figure 2). Figure 2 also shows how an even greater level of resilience can be achieved by replicating the SAN infrastructure over two separate sites.
Some SANs require additional specific software on the hosts connected to the SAN beyond the HBA driver itself. This may be needed to provide the resilience and management of the SAN and may be required even when only one HBA is installed in a host server. On some SAN systems this software can be expensive and an unexpected additional cost of purchasing an HBA.
In the switches a technique called zoning is used to partition access between devices that are allowed to communicate with each other. Zoning might also be used to create barriers between different operating systems environments e.g. between UNIX and PC systems or between corporate business systems and student teaching systems. Further control of access between SAN components is possible by LUN masking, usually implemented in the storage arrays (see below).
Fibre channel fabrics now generally run at 2Gbps, with 1Gbps ports also still available. Most 2Gbps products can switch to 1Gbps, thereby still preserving investments in the slower technology. A 2Gbps link translates to 200MBps transfer rates and Fibre channel also can offer full-duplex mode. However, some connection slots for the HBAs cannot sustain these throughput rates in full-duplex mode. PCI 64-bit cards at 66Mhz or PCI-X slots (133Mhz) are the best choice to ensure high end-to-end transfer rates to fully utilise the potential of Fibre Channel. Very recently, 4Gbps Fibre Channel has been announced, but many Fibre Channel proponents believe that 10Gbps should be the next leap in performance to match 10Gbit Ethernet. However, 10Gbps Fibre Channel is not intended to be backwards compatible with previous slower standards.
In fact, whilst SANs have been mainly based on Fibre Channel technology, new IP based options using more commodity-like components (e.g. Ethernet switches) are a possibility in the future. In particular, the standard for iSCSI (Internet SCSI) was agreed during 2003 and many products supporting iSCSI [6] are now appearing on the market. Servers for iSCSI either require an iSCSI HBA (known also as a TCP Offload Engine) or a standard Ethernet Network Interface Card (NIC) with a special software iSCSI driver on the host server, with the former very much preferred. Storage for iSCSI needs either a native iSCSI interface or (perversely) can be Fibre Channel storage with an iSCSI to Fibre Channel gateway device.
iSCSI as a replacement for Fibre Channel based SANs is not going to be realistic until 2005 and, whilst opinions vary on this topic, iSCSI may complement Fibre Channel, which might remain in the data centre to support enterprise systems. However, 10Gbit Ethernet is already available and there is a more aggressive roadmap for future Ethernet standards than Fibre Channel. These factors may reduce the theoretical advantages of a Fibre Channel fabric for the transport of storage data.
1.3 Disk Technologies
Enterprise class storage arrays used in SANs generally use Fibre Channel interfaces to internal FC-AL connections in the storage array with FC disks attached. More modestly priced storage arrays are available with an internal connection to either SCSI or ATA disks. Fibre Channel disks are designed for enterprise-class use and usually have top-end performance and reliability characteristics and thus attract premium prices.
However, in recognition of the fact that not all data needs to be treated equally (see the section Data Categorisation Strategy below), many SAN vendors now offer the option of storage arrays based on Serial ATA (SATA) disk technology [7]. Serial ATA disks are an evolution of the commodity parallel ATA (or IDE) disks used in PCs with a design intention of being at ATA-like price-points with SCSI-like performance. Such disks may not be suitable for mission critical enterprise database applications, but may have a role for less critical or low usage data. Indeed, some storage vendors now offer storage arrays that can accommodate various varieties of disk in the same cabinet e.g. FC and SATA. In such examples, the interface to the disk tray from the hosts/fabric is still Fibre Channel.
Low priced RAID arrays based on traditional ATA/IDE disks that can be incorporated into SANs have also been available for some while, but are now likely to be displaced by these Serial ATA arrays for the lower end of the storage array market. Another new technology for disk drives is Serial Attached SCSI (SAS) [8] that, like SATA, continues the trend away from parallel methods of data transmission to serial methods (with simplified cabling and more flexible connectivity).
The threat of widespread adoption of Serial ATA and Serial SCSI disks should also have a beneficial knock-on effect on the prices of top-end Fibre Channel disks.
1.4 Storage Arrays
Storage Arrays present a view to the host servers called a Logical Unit (LUN) that appears to the host as a disk volume. A LUN is in itself a level of virtualisation as it is usually associated with some degree of RAID level and thus formed from parts of several disks. Storage arrays typically implement several levels of RAID, with levels 0, 1, 3 and 5 very prevalent, with other combined levels (such as 0+1) also possible.
The degree of control over the placement and use of actual disks varies when defining a LUN – some storage arrays offer a full level of virtualisation where storage administrators merely request a LUN of a given size, which is then created internally within the array and spread over many disks as determined by the array’s in-built virtualisation. In many other arrays, however, there is full control over the placement of LUNs across the actual disks in the array and the array may need its storage to be partitioned into separate “RAID Groups”, with a particular RAID level associated with the group. In such cases, administrators must carefully bear in mind the potential needs for expansion when designing their LUN structure (and associated RAID types) to ensure extra disk space may be added and allocated to the LUN. Otherwise, expansion may sometimes need to be achieved by defining a new, larger LUN and copying all the data across.
Storage arrays typically also include caches to improve read and write performance by acting as a buffer between the storage array and the server requesting the I/O operation. It is particularly beneficial for RAID types that require writing to multiple physical disks. However, events such as power failures or failure of the array’s storage processor do require very careful attention and techniques such as battery backup and writing of the cache to disk when these events occur should be used.
Storage controllers in the array control the data flows between the array’s Fibre Channel connecting ports and the actual disk modules constituting a LUN. The storage controller will also monitor the basic “health” of the array and its disks. An enterprise-quality storage array will also typically have more than one storage controller, providing extra resilience and sometimes extra performance as well. If multiple controllers are sharing the I/O load, then there is an additional level of complexity of cache management to ensure coherency between the caches.
Storage arrays have varying numbers of external Fibre Channel interfaces to connect the disks to host servers via the fabric. Although 2Gbps fabric equates to 200MBps transfer rates, the total aggregate sustainable throughput into and out of the storage array needs careful consideration for the workload patterns to be supported. The number of internal loops inside the storage array and numbers of disks attached to these loops should also be considered when assessing the suitability of SANs for really I/O intensive work.
The distribution of paths from LUNs through the storage controller(s), external Fibre Channel interfaces and the fabric may need careful consideration if load balancing software is not being used in order to ensure even distribution of I/Os through the SAN.
Disks in a storage array are usually hot-pluggable so that service to users is not disrupted when disks are added or removed. The ability to allocate some disk as hot spares is also usually supported. Use of a hot spare means that in the event of a disk failure, the storage array will automatically begin to rebuild the failed drive’s data onto the hot spare disk, continuing service if suitable RAID levels are in use. When the failed drive is then replaced, the storage array will usually rebuild onto the newly replaced disk, leaving the original hot spare available to handle any subsequent failure.
The Storage Arrays possess varying levels of “intelligence” that depends on the storage controller(s) within the array and on the software products installed in the controllers. An example of such capability is LUN masking that is used to determine which host servers can have access to each LUN. This prevents unauthorised access to data from other servers or from server operating systems (e.g. Windows Server versions prior to 2003) that search round for available storage on booting. Although LUN masking is often implemented in the controller within the storage arrays, it is also often a feature of storage virtualisation software (see the Storage Virtualisation section below).
The storage array controllers will also typically be able to provide other enhanced facilities such as:
- Snapshots – rapid point-in-time copy of a LUN with only changes recorded; may be attached to another server for analysis or to be backed up; unchanged blocks refer back to the original source LUN
- Clones – full point-in-time copy of a LUN that can be used as a true copy of production data; uses same amount of disk space as source LUN and may take time to produce the clone, depending on its size
- Replication/Mirroring – the ability to replicate data to another storage array, either synchronously or asynchronously; may be used for DR purposes and particularly useful when the two storage arrays are at physically separate locations
Please note that vendor terminologies for the above may vary, and that these capabilities are usually provided by additional software options for the storage array’s controller and may well be additional cost items, sometimes attracting premium pricing. Storage Virtualisation software if used (see the Storage Virtualisation section below) can also provide this enhanced functionality.
The enhanced facilities above are those that distinguish SAN solutions from Direct Attached Storage solutions and are the basis of the additional flexibility, improved resilience and enhanced disaster recovery capability that will underpin the business case for a SAN. NAS devices can also incorporate some of these enhanced facilities such as snapshots, but are not generally designed for replication to other devices.
When using these enhanced features of storage arrays to fully exploit the potential of SANs, ensure that any limits, both inbuilt/technical and through licensing, are known in advance when configurations are being planned. There may be limits on the number of snapshots allowed or number of LUNs that may be mirrored etc.
2.0 Centralised Backup Systems
Ideally a SAN should be linked to a Centralised Backup System (CBS) to provide operational efficiencies in backup/restore operations and eliminate the plethora of disparate backup systems typically found in an HE or FE institution.
These different backup systems typically arise when PC and UNIX support staff pursue their own backup utilities or use those supplied by the operating system vendor. Similarly, different schemes may be used to address differing business/academic requirements.
Current backup systems are diverse, complicated and not easy to manage. Most SANs will be purchased with an associated tape library capable of reading/writing to tape media in several tape drives with robotic control of a large number of tapes held in slots in the library. Enterprise tape libraries will typically have features such as bar code readers to identify the tapes and be capable of exporting/importing tapes that need to be taken off site/brought back onsite to and from fireproof safes. Not all tape libraries have native Fibre Channel interfaces, so it may be necessary to attach them to the SAN via a Fibre Channel to SCSI bridge device.
Various types of tape media are available for use in tape libraries. The DLT format has been the most popular for several years with the LTO/Ultrium format rapidly gaining ground. In the future these two formats are expected to dominate with a roughly equal market share [9] and convincing roadmaps for development [10, 11].
Both these tape formats have already gone through stages of evolution with different generations of tapes and drives available with continually improving capacity and performance characteristics as new versions are introduced. The latest generation of LTO (LTO 2), for example, has excellent throughput characteristics and a single server may not be capable of sustaining such a drive in its optimum streaming mode. Backup products that allow the inter-leaving of backups from multiple sources can assist with more efficient utilisation of modern tape devices. Sites should, however, consider the impact on restore times of highly inter-leaved backups.
A SAN based backup solution allows back up of data to be consolidated into one system architecture. With appropriate options purchased with the backup software product, backups may optionally be driven across the SAN thus reducing the bandwidth overhead on the campus network - the so called “LAN free” backup mode [12]. LAN free backups are attractive when network traffic levels on the LAN are an issue, but sites should note that LAN free options in backup products are usually additional cost items, often with a premium price attached.
A further refinement is the concept of server-free backups [12], where data transfer occurs directly between the storage array and the tape library, although server-free backup products are not yet mature and proven.
Enhanced facilities offered by the SAN (such as snapshots) can also be used to reduce the impact of backup activities on production systems. A near-instantaneous snapshot may be taken and the newly created snapshot LUN then attached to the backup server to carry out the actual data backup, reducing the amount of time databases, for example, need to be offline or in hot backup mode.
Increasingly, with the availability of cheaper disks (e.g. SATA) in SAN storage configurations, backup vendors are also providing options for disk-to-disk backup. In this scenario, data can be copied to disk in real-time over the SAN and then backed up to tape off-line, e.g. during the day. This greatly extends the ‘backup window’.
Deployment of a SAN would allow for consolidation (with matching cost saving) on backup infrastructure over its life-cycle along with increased productivity of systems and support personnel.
3.0 Strategic Fit and Industry Positioning
The take up of NAS and SAN solutions is rising and a NAS or SAN solution is cheaper to run than DAS. Total cost of ownership in the generic business sector has been found to be 55-60% cheaper than an equivalent amount of DAS storage. The industry as a whole reports an average support cost reduction of 80% (based on FTEs per MB storage) compared with supporting the equivalent DAS infrastructure. Further cost savings are seen following backup consolidation (typically 50-75% in tape drive consolidation) [13].
The benefits of using SAN and NAS technologies to consolidate storage are compelling [14]. The Butler Group believes that storage consolidation should be a primary objective for an organisation looking to optimise its IT infrastructure [14].
Fibre Channel SANs and IP-attached NAS are now established technologies. The usability of management tools is rapidly improving as they provide greater automation and become available for more platforms. In most cases, the savings and improvements in staff productivity, utilization rates and data availability more than justify the additional cost of installing SANs.
The future will lead to more interoperability and the adoption of open standards throughout the industry. New developments will see ‘intelligence’ being combined with storage. For example, an application should be able to tell the storage system that it needs more storage and then be assigned that additional resource automatically.
Major operating systems vendors are also acknowledging the greater uptake of SAN technologies. For example, Microsoft’s Windows Server 2003 operating system has new features to enable SAN support [15]:
- Virtual Disk Service (VDS)
- Volume Shadow Copy Service (VSS)
- Multipath Input/Output (MPIO)
- Internet SCSI (iSCSI) support
- Ability to boot from a SAN
- Controlled volume mounting at boot time
Storage vendors are producing “plug-ins” for the Windows Server 2003 features above e.g. for VDS, VSS and MPIO – this trend for more SAN awareness in operating systems will further aid the manageability of SANs.
4.0 Data Growth Rates and Their Management
The explosive growth of the Internet, email (with attachments), integrated enterprise business suites and greater use of digital media in personal devices (e.g. cameras) is creating unprecedented demand to store, retrieve and communicate information. In fact the world’s population is expected to create more information in the next three years than in all the years of prior existence! [9].
Generic demand for storage across all business areas is growing. Storage growth estimates show a 76% increase in demand for storage per year across all data types. Big growth areas include e-mail (100-300% growth per year), data warehousing (72-115% per year) and internet content (75%). Customer Relationship Management (CRM) systems are also requiring more storage (growth 47% per year) [13].
Demand for storage in the HE/FE sector is growing in line with other business sectors. Growth is predicted within the e-mail and internet content data types and also newer functionality such as data warehousing and digital media storage.
Storage may be becoming cheaper in terms of cost per megabyte, but high data growth rates and the cost of management and backup of all this data are becoming prohibitive.
Fundamentally over recent years, the cost of storage has decreased in terms of the capital cost per megabyte of storage. However, the total lifetime cost of storage including its management, backup and maintenance should be considered. Many different industry analysts quote prices per megabyte of storage and varying factors of that cost per megabyte to manage it. A conservative estimate is a factor of three for costs of management of storage over its lifetime versus initial purchase costs.
Industry analysts also publish varying figures for the different costs of managing DAS, NAS and SAN storage. However, the essential point is not the absolute value of any analyst’s figures for these architectures, but the experience borne out in reality is that many more gigabytes, even terabytes of storage can be maintained by a given amount of staff resource for a SAN compared to a DAS scenario. These economies of scale are even more apparent when a Centralised Backup System (CBS) is an integral part of the SAN landscape.
Management of escalating amounts of storage is indeed one of the chief challenges facing IT support organisations in all sectors.
5.0 Storage Management
Storage management can encompasses several layers: management of the individual devices constituting the SAN (SAN Management), management of them as a virtual resource pool (Storage Virtualisation) and management/reporting of the data characteristics and growth patterns (Storage Resource Management).
5.1 SAN Management
SAN Management software is needed to actually configure and monitor the components of the SAN to enable them all to function together. It is directly concerned with enabling and controlling the movement of data within the SAN infrastructure.
SAN Management products are typically able to:
- Discover devices attached to the SAN – hosts, storage devices, switches and other fabric components
- Manage and monitor ports on the Fibre Channel switches
- Administer zoning on the switches to selectively enable access
- Administer LUN masking in the storage arrays to partition access to particular servers
- Monitor traffic levels and performance between components and through the switches
- Manage configuration changes within the SAN
5.2 Storage Virtualisation
Virtualisation is an overused term in computing and in the specific area of storage, there is also much scope for confusion over the use of the term “storage virtualisation”. Some storage arrays, for example, have in-built virtualisation features whereby the location of data and the disposition of storage LUNs are hidden.
Storage Virtualisation for the purposes of this report is an additional (optional) layer of storage management that can provide a centrally managed pool of storage with virtual volumes being made available to servers, as illustrated schematically in Figure 3 below. Such virtualisation solutions have the additional merit of operating in heterogeneous SAN environments, consolidating the storage devices from several vendors.
Such virtualisation solutions [16, 17] fall into two camps: (a) in-band or symmetric and (b) out-of-band or asymmetric solutions. For in-band solutions all control functions, metadata and data pass through the storage virtualisation server or appliance. For out-of-band solutions, only control data and metadata (data about the data) passes through the storage virtualisation server or appliance with raw data flows being direct between host servers and the storage arrays. Unlike in-band products, out-of-band solutions require the installation of an agent on each host server to enable communication with the storage virtualisation server for volume information and control, but allowing direct communication with the storage virtual volume for actual read/write data transfer operations. (Some in-band solutions may also require software on the host anyway.)
Storage Virtualisation products are typically able to:
- Allocate virtual volumes of a given size with no knowledge needed of the underlying hardware architecture
- Enable non-disruptive dynamic expansion of virtual volumes (subject to the host operating system being able to cope with this)
- Operate in a heterogeneous environment with servers and storage devices from various vendors
- Manage the multi-pathing between servers and storage devices
- Manage the data mirroring capability (potentially across heterogeneous environments)
- Support LAN-free (and sometimes Server-free) backup schemes
- Support snapshots for more efficient backups or point-in-time restore capability
- Implement LUN masking to partition access to particular servers
5.3 Storage Resource Management
Storage Resource Management (SRM) is a higher (optional) layer of management that doesn’t control the flow of data within the SAN, but rather analyses and monitors the patterns of data access and usage, possibly including charge back features. Some basic SAN management software is essential for the functioning and continued well-being of a SAN; SRM software is a desirable, but not essential component of a consolidated storage environment.
One of the key drivers for a SAN is the need to consolidate storage (and servers) and this task is considerably aided by knowledge of how existing storage is being utilised, the patterns of data access and how much data is changing etc. For example, data may be replicated several times or old data stored without being deleted, both adding to the cost of storage without providing any added value to the organisation. SRM software can provide reporting features to better maintain existing storage and to more effectively plan future storage architectures.
SRM products are typically able to:
- Provide an enterprise-wide inventory of storage (including storage outside of the SAN)
- Identify unused files (not accessed for a long time or “orphans” with no current valid data owner)
- Check that modified files are backed up
- Monitor overall growth trends to enable future storage needs to be planned
- Identify the specific types of data being stored and their storage growth trends
- Provide utilisation reports by person, department, filesystem, server etc
- Provide charge back mechanisms or reports that can be input into separate charge back systems
- Provide web-browser interfaces and be portable to UNIX, Linux and Windows environments
- Enable implementation of storage polices for different categories of data
- Support heterogeneous platform and storage environments
5.4 SMIS/Bluefin Storage Management Initiatives
SAN management tools have historically been very specific to each vendor’s storage components with poor interoperability across heterogeneous environments. Storage vendors are now beginning to produce products that do operate with a wider variety of storage hardware. Also, Storage Virtualisation products (see above) that can hide an underlying miscellany of diverse equipment types may in their own right also provide overall SAN management functions.
In recognition of poor interoperability potentially impeding the uptake of SANs, the Storage Networking Industry Association (SNIA) [18] has sponsored standards for multi-vendor interoperability. One such standard is the Storage Management Initiative Specification (SMIS) [19] that was also formerly known as “Bluefin”. SMIS-compliant products are evolving but by 2005 are only expected to offer rudimentary SAN management functions such as device discovery, configuration and error management. More sophisticated functionality such as snapshot management is not expected before 2006-7 in SMIS-compliant products. Consequently, inclusion of standards-based support cannot yet be a decisive factor in choosing a SAN supplier.
6.0 Data Categorisation Strategy
Not all information needs to be treated the same as the value and importance of information varies. Information needs to be categorised into different levels of criticality that determine the appropriate means of handling the data throughout its lifetime. As the usefulness of data varies over time, there is a need to consider Information Lifecycle Management (ILM) so that data may be stored on different types of media as it ages and becomes less business critical. For example data may initially reside on enterprise-class Fibre Channel disks in a storage array and then be later migrated to cheaper Serial ATA disks or to nearline tape storage. Eventually, after a further period of time, the data might finally only reside on offline tape storage.
In normal circumstances each category would have appropriate availability expectations, supported by differing infrastructure architectures e.g. use of clustering or mirrored data on the SAN etc.
In the case of an incident, the time to recover and the point to which recovery must be made need to be considered for each category of data [20]. These are the Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO), respectively. The RTO is the target time taken to recover a system and the RPO is the target period in time to which data must be recovered (dependent on how much data/transaction loss can be tolerated).
Varying levels of criticality are used, with typical definitions being similar to the following:
- High criticality information is that which is essential for the organisation to operate effectively and where the absence of it, or inaccuracy in it, at a time it is required, would have catastrophic results on the functioning of the organisation or its reputation
- Medium criticality information is that which would take considerable effort to recreate if it became unavailable for a prolonged period of time, or corrupted, and which would cause significant disruption to the functioning of the organisation through its absence or inaccuracy.
- Low criticality information is that which can be recreated easily and the unavailability or inaccuracy of it would be nothing more than inconvenient to the organisation.
Some sites may also have another category:
- Mission critical/continuously available information is that which must be available continuously where no downtime can be tolerated and examples would be online e-commerce sites, air traffic control or critical health care systems.
Typical normal availability expectations and recovery objectives in the case of incidents might be:
Normal Availability | RTO | RPO | |
Mission critical | 100% | 0 | 0 |
High criticality | 99.99% | 0-12 hours | Minutes |
Medium criticality | 99% | 12-72 hours | Hours (up to 24 hours) |
Low criticality | 95% | > 72 hours | Days (1-7) |
7.0 Fit of a SAN into a Data Categorisation and DR Strategy
A SAN can support an institutional data storage strategy whereby data of a high criticality is mirrored and duplicated on the SAN (providing high availability and resilience). Medium criticality data may also be stored on the SAN but might not be mirrored in real-time, or an asynchronous mirror might be used or no mirroring at all. Low criticality data could be stored on the SAN or elsewhere, with no mirroring or duplication. All data types would be backed up, however, using the associated centralised backup system tape library (or libraries) attached to the SAN.
The SAN permits rapid server recovery in the event of a server hardware failure – replacement servers can be simply “pointed” to the appropriate disks on the SAN. There is no need to move data to the new hardware. This downtime could potentially be reduced to minutes. Multiple copies of data can be held on the SAN through technology known as “snapshot” capability. Snapshots of mission critical data can be taken at various times during the day without impacting the user. In the event of a corruption, this snapshot can then be presented to the host system in minutes. In addition, the increased fault-tolerance of a SAN would mean fewer out-of-hours data manipulations, therefore generating a saving in staff overtime.
Many organisations are moving towards greater use of server clustering to achieve higher availability and better recovery ability. In a clustered environment, the failure of any one server would not cause a service to be unavailable as other servers in the cluster would take over the work load. Clusters operating over a distance (often known as “stretched clusters”) also provide disaster recovery capability if the workloads can be switched to the other, still-functioning site in the event of total failure of a data centre, for example.
SANs with their shared storage naturally lend themselves to supporting server clustering. A combination of clustered servers accessing data on the SAN with its inherent reliability features provides a means of ensuring high availability. If sites have multiple machine rooms then separate storage nodes could be located sufficiently apart and data mirrored such that loss of one data centre would not impact the availability of high criticality data. Deploying a SAN across two nodes in two separate locations would provide for much improved disaster recovery capability, as illustrated by Figure 2.
8.0 E-Science/Grid Support
The national e-Science Grid Programme [21] and other Grid initiatives are aiming to develop distributed high-performance computational facilities for researchers to be further complemented by additional data intensive facilities. This combination of compute intensive and large capacity data storage is the basis for a “research infrastructure”.
SANs with their inherent reliability features and ease of scalability can naturally support this required data intensive infrastructure. Specialist servers/high-end workstations might be directly attached to the SAN to support these e-Science and other research initiatives.
This research infrastructure could be enhanced by possible extras such as Hierarchical Storage Management (HSM) system or other archiving facilities that may not necessarily be included with any initial SAN configuration. Research data is typically characterised by large volume and often relative inactivity once analysed. It needs to be archived in the longer term as, typically, further analysis is run later as better systems and means of analysis develop.
9.0 Benefits of a SAN
9.1 Reduced hardware capital costs
Centralised storage means that new hardware procurement for centrally managed systems will require little or no local disk provision. For major corporate systems and large server environments, this affords a significant cost saving.
Furthermore, in those HE/FE establishments with large numbers of servers run by local departmental IT support personnel (outside of the central IT service) , substantial further cost savings would be made through server consolidation (reducing the total number of servers needed) across the entire HE/FE establishment. This could be achieved by removing the need for departments to host, operate and manage their own servers for file or email provision, for example.
An additional cost saving would be afforded by being able to redeploy some staff effort currently assigned to managing these departmental servers.
A SAN would permit central IT providers to offer “managed storage” to users across the campus. Disk space could be made available on the SAN through central high-availability clustered servers. This would maximise the “economies of scale” benefit of deploying a SAN.
Surveys indicate that 50% of storage is typically unused in organisations (commercial and public sector) so that savings can be made by not purchasing this unnecessary storage across the institution as a whole.
9.2 Reduced effort to manage storage
The centralised, consolidated storage environment provided by a SAN (and particularly with an associated CBS) provides for easier management of data compared to a DAS environment. From figures given in the Merrill Lynch [13] or Butler [14] reports, nearly half the total lifetime cost of a DAS environment is the people cost, whilst for NAS and SAN environments people costs are only about an eighth of the lifetime costs of ownership.
This enhanced efficiency of management for networked storage environments (SAN and NAS) results in a greater volume of data being capable of management by a given individual. Industry surveys suggest that the “Terabytes per FTE” are between 3 and 4 times greater for networked storage compared to DAS.
9.3 Increased productivity through improved fault tolerance and DR capability
The enhanced disaster recovery capability of a SAN would lead to increased availability giving the potential for significant staff (and student) productivity savings. It is very difficult to cost accurately the effect of server or system downtime; therefore some sites may not be able to include downtime reduction as a tangible benefit in their cost/benefit analysis. However, it should then be highlighted as a major intangible benefit in the business case if actual costs savings cannot be accurately determined.
9.4 24x7 Availability
A SAN can provide 24x7 availability as there should be no single point of failure within a suitable designed SAN infrastructure.
Most HE/FE institutions need to accept a trend to widen their provision beyond the needs of the classic 18 to 21 year olds who enter higher education after A-levels (or after a gap year). National pressures for widened participation (from non-traditional student backgrounds), increased student numbers and distance learning all dictate a need for facilities to be available during non-standard parts of the day.
Traditional students also now have higher expectations for an “enhanced student experience” (partially through fees being payable), reflecting trends in society as a whole for greater expectations from “infrastructure” reliability.
This all leads to the need for increased availability of teaching services so that from the student point of view teaching and learning facilities are always available (i.e. 24x7).
9.5 More efficient backup
As the tape library in a typical Centralised Backup System may be used to backup data from several different servers, the speed of backups may be improved. This is a result of data potentially being inter-leaved on the tape from several sources, sustaining the tape drive at its maximum streaming operating efficiency. The utilisation of tapes may also be improved by this sharing of tapes from many sources, leading to reduced numbers of tapes being needed and also fewer tape drives to backup a given amount of data.
A large tape library will essentially operate in unattended mode with the minimum of human intervention. A sufficiently large tape library will enable cycles of backup tapes to be kept in the library for longer periods e.g. a full week or even a month or longer, avoiding the need to physically fetch tapes from the fireproof safes for restores. Clone copies of tapes for safe keeping in fireproof safes should still be made, so some operational effort to remove tapes from the library for storage in a fireproof safe will still be needed. However, considerable operational efficiencies will be achieved and the CBS element alone of any SAN Business Case should produce substantial savings by rationalisation of backup/restore activities.
9.6 Scalable Storage
A SAN allows for “pay as you grow” storage scalability, pro-active storage planning and non-disruptive growth/reconfiguration. Extra disks can be purchased at any time and added to the total storage provision. Configuration utilities allow for that storage to be ‘presented’ to any host that requires space. Free space can also be re-assigned as necessary. Typical sites find that up to half their disk space is unused (not allocated) and in hardware terms alone (i.e. not including maintenance and administration) substantial savings in purchase costs could be made.
9.7 Interoperability between diverse systems
Disparate computer platforms can share the SAN as their common mass storage system. The SAN would then provide a totally heterogeneous environment, enabling cross platform functionality (e.g. between UNIX, PC and other operating systems) for activities such as file and print serving, database hosting, high-performance computing needs etc. In fact, new developments allow systems to boot direct from a SAN; servers would then require no local storage whatsoever. This could then possibly allow “spare” servers to be kept that are able to be allocated to any of several operating systems in the event of a server failure.
9.8 Centralised Management
Centralised management will afford staff productivity gains as well as reduced total cost of ownership through intelligent storage management (only one storage dedicated team performing the work of many systems administrators), economies of scale (purchasing terabytes of disk space in one go rather than lots of small disk sub-systems), cheaper hardware (servers no longer need specifying with individual disks and expensive RAID controllers), centralised (and shared) backup and restore architecture. Centralised management will also facilitate common standards in storage management according to an institutional storage policy (e.g. standards in disk allocation, levels of fault-tolerance, scalability, utilisation and back up/recovery mechanisms across the institution). This should also assist sites with disaster recovery and security audits.
10.0 Justification for SANs – Writing the Business Case
Investing in a SAN infrastructure is a major undertaking and whilst it is beyond the remit of this report to quote prices, it is certainly easy to incur expenditure in the region of between say £200k and £1M for an “institutional SAN”. Consequently, purchase of a large scale SAN (and associated Centralised Backup System) should be regarded as an institutional strategic decision with any Business Case containing many references to aims and objectives set out in the organisation’s institutional strategy documents.
Typical institutional strategic aims that might be referenced are in areas such as the following:
- Electronic storage of data and processes and procedures to ensure data backup and recovery (may be part of an institutional data storage strategy)
- Adherence to institution-wide standards
- Use of common integrated systems
- Central generic provision versus devolved local provision
- Support of lifetime learning, distance learning and widened participation
- Support of any time, any place, 24x7 access
The benefits of a SAN outlined in the above section should be linked to these institutional objectives to underpin a more robust information infrastructure for the institution.
The SAN Business Case therefore requires the support of senior business managers in the HE/FE institution and should not be seen as the latest technological development that interests just the central IT provider in the institution.
Initial purchase costs and ongoing maintenance costs are typically large and any benefits and return on the investment may be treated with scepticism initially in some quarters. However, the production of a properly costed business case showing the total costs and benefits [22, 23 and 24] over the extended lifetime (e.g. 5 years) of this equipment should show net benefits well within that period.
Factors to include in any cost/benefit analysis are:
- Reduction in numbers of servers
- Reduction in backup infrastructure complexity
- Staff productivity gains for backup administration
- Savings by not having large amounts of unallocated disk over the campus
- Reduction in systems and support staff out of hours working
- Reduced costs of server upgrades – little or no local disk required
- Possibility of offering centrally managed “commodity” storage leading to staff efficiency gains elsewhere on campus in a devolved environment
- Improved productivity by reduced downtime
11.0 Risks/Issues
The above two sections of this report have presented the positive aspects of the potential of a SAN (as needed to make the business case). However, procurement, installation and configuration of SANs in 2003 is still a highly complex and lengthy exercise with many unexpected interoperability problems.
The following risks/issues include some of the negative aspects of using SAN technology that sites must be fully aware of and these are based on the negative or unexpected experiences of HE/FE sites that have already procured and installed SANs! Some of these risks should also be included in the Business Case so that senior managers in the HE/FE institution have adequate information on which to base their decision on the merits of SANs (or not).
a) For a SAN to encompass a substantial proportion of the data storage needs of an HE/FE institution there needs to be a cultural change in the way storage is considered – data to be considered by type and criticality according to an institutional storage strategy, with fewer individuals having control over their own disk storage. A SAN is easier to “sell” if departments with their own server/storage solutions are prepared to accept more centrally delivered storage. Political considerations may require as much time as technical and financial issues.
b) An inadequately resourced or poorly designed SAN can itself be a single point of failure for a large volume of data. On the other hand, a properly resourced and well designed SAN can eliminate single points of failure and provide enhanced resilience, but this does require a high level of investment to achieve (e.g. with dual fabrics, dual HBAs, mirroring etc).
c) A properly resourced and well designed SAN also needs to take into account aggregate throughput needs of all the many servers attached, so as not to become a performance bottleneck.
d) SANs are relatively complicated to implement and require significant training needs for both server/storage support teams and operations support staff that carry out backups and monitoring etc.
e) Sites not having appropriate staff to run a SAN may wish to consider NAS as a solution to server/storage consolidation needs, particularly if the serving of file systems caters for a large proportion of storage needs. Buying many different smaller NAS devices should be avoided, however, as the advantages of storage consolidation and staffing efficiency gains will be lost with equipment from many different vendors. Alternatively, a managed SAN service provided by a storage provider or leasing of storage and/or services could be considered.
f) DAS still has a role as the cost of HBAs, dual pathing and requisite software for SAN environments can add a premium to the costs of connecting a server, making it unrealistic for small or completely self-contained functions.
g) Despite vendor marketing ploys, even now there are many interoperability issues between the various components of a SAN. It is recommended that institutions wishing to procure a SAN probe vendors carefully about such issues during tender negotiations. An example is HBA, switch, storage array, host operating system and patch level compatibility.
h) Pre-sales staff appear to be genuinely unaware of some of the unexpected limits or compatibility issues that arise when SAN implementations begin in earnest.
i) Similarly, pre-sales staff are not always aware of all software requirements for clients of a SAN, leading potentially to unexpected additional expense. This confusion can still remain after installation has begun! Software requirements for HBAs seem a particular area for confusion with many vendors.
j) Incompatibility issues mean that vendors often want very exact details of operating system levels and patch levels of host servers together with firmware levels of HBAs, switches and storage controllers, and need to be informed of any configuration changes in order to ensure that they will offer support contracts. This requires more disciplined management of the SAN compared to what HE/FE sites have traditionally been used to. This leads to an additional level of formality of the interface with the supplier’s maintenance division.
k) As a result of the above, some sites set up dedicated “Storage Teams” within their IT support organisations, whilst others handle the new workloads amongst existing server support teams. This issue needs to be considered at each site with possible structural and process changes introduced [25].
l) Many features that truly exploit the potential of a SAN (such as LAN-free backups, the ability to take snapshots or be able to do remote mirroring etc) will be extra cost items and vendors should be probed carefully to ascertain what is included in the base cost of a SAN. Licences based on total size of storage or with bandings related to storage size should be studied carefully, particularly when future growth is considered. Similarly, check that licences don’t have restrictions on numbers of hosts to be attached; again bearing in mind future growth in numbers of SAN attached servers.
m) The whole question of whether to use some storage virtualisation product as another layer between the host servers and the storage will require much thought and debate. Storage virtualisation products support heterogeneous environments and so prevent lock-in with any given storage vendor and even allow existing storage to be used. On the other hand, it does mean effectively a lock-in with the storage virtualisation vendor to achieve this freedom to choose hardware at will! If virtualisation is chosen, there is also the in-band versus out-of-band debate to be had as well.
12.0 Glossary
Abbreviation | Term | Explanation |
DAS | Direct Attached Storage | Storage locally attached to a particular server. |
SAN | Storage Area Network | A dedicated network of shared storage devices. |
NAS | Network Attached Storage | Storage available from a device resident on the LAN and shared via file protocols such as NFS or CIFS. |
LUN | Logical Unit (or Logical Unit Number) | A logical presentation of disk space created from individual disks, groups of disks or parts of multiple disks as defined by a RAID controller. |
FC | Fibre Channel | A high speed, serial data interface allowing communication between servers, data storage devices and other communications devices such as switches. |
Switch | A device to connect hosts and storage devices, capable of supporting multiple dedicated non-shared data paths between devices. | |
Zoning | A technique to partition FC fabrics to prevent unrestricted access between hosts and devices. | |
LUN Masking | A technique to associate particular LUNs with particular host servers. | |
Fabric | The switches and connection infrastructure providing the “network” paths between hosts and storage devices. | |
ISL | Inter-Switch Link | A link between FC ports on a switch where one switch is cascaded off the other. |
iSCSI | Internet SCSI | A standard to enable block data transfers over IP networks by wrapping SCSI sequences into TCP/IP packets. |
HBA | Host Bus Adapter | Used with Fibre Channel to take blocks of data and segment them into FC frames for transmission over the FC fabric. |
GBIC | Gigabit Interface Converter | Converts optical signals into electrical signals and used to attach the SAN’s fibre connections. |
SCSI | Small Computer Systems Interface | An evolving standard with many variations – first developed in 1986 – for transmission of data between hosts and devices (usually disks). |
ATA | Advanced Technology Adapter | A type of commodity disk commonly used in PC systems. |
IDE | Integrated Drive Electronics | Another term for commodity disks used mainly in PC systems. |
SATA | Serial ATA | Serial (rather than parallel) version of ATA technology. |
SAS | Serial Attached SCSI | Serial (rather than parallel) version of SCSI technology. |
PCI | Peripheral Component Interconnect | An internal computer bus to attach devices; exists in several variations with different throughput characteristics. |
TOE | TCP/IP Offload Engine | Used with iSCSI to take blocks of data and segment them suitable for transmission as IP packets over the IP network. |
Snapshot | A rapid point-in-time copy of a LUN where only changed blocks are tracked with unchanged blocks still being referred to in the original source LUN. | |
Clone | A complete point-in-time copy of a LUN. | |
LAN | Local Area Network | That part of the network over which data normally travels between client machines and servers or between servers and other devices. |
CBS | Centralised Backup System | Using large centralised tape libraries to combine the backups from several hosts. |
CIFS | Common Internet File System | File system available to client machines over the network from a server machine; based on Microsoft’s earlier Server Message Block (SMB) system and popular in Windows environments. |
RAID | Redundant Array of Independent Disks | Technique to improve the reliability and performance of disks by aggregating several disks and providing the appearance of one large disk to host servers. |
NFS | Network File System | File system available to client machines over the network from a server machine; developed by Sun Microsystems and popular in UNIX environments. |
HSM | Hierarchical Storage Management | A storage system whereby data may reside on different devices, ranging from fast-access media (e.g. disk) to slower media (e.g. tape) with a hierarchy of such storage devices available. |
ILM | Information Lifecycle Management | Recognising that not all data needs to be treated the same throughout its life. |
RPO | Recovery Point Objective | The point in time to which data must be recovered after an incident, reflecting the amount of tolerable data or transaction loss. |
RTO | Recovery Time Objective | The amount of time taken to recover a system after an incident, reflecting how long it is tolerable for a system to be out of service. |
SRM | Storage Resource Management | Software to analyse and monitor patterns of data access and usage. |
SNIA | Storage Network Industry association | A not-for-profit trade association with a remit to ensure “that storage networks become complete and trusted solutions across the IT community". |
SMIS | Storage Management Initiative Specification | An initiative by SNIA to “to develop and standardize interoperable storage management technologies and aggressively promote them to the storage, networking and end user communities”. |
Bluefin | An older name for what how now become SMIS. | |
Virtualisation | An over-used term to indicate any apparent provision of an entity that does not physically exist, but is made to appear so by software or other simulation. | |
DLT | Digital Linear Tape | A very widely available tape technology that has been around for several years with continually evolving improved versions. |
LTO | Linear Tape-Open | A tape format originally developed jointly by IBM, HP and Seagate and also referred to as Ultrium. |
LAN-free | A backup technique where data does not travel over any part of the LAN, but is self-contained totally within the SAN infrastructure. | |
Server-free | A backup technique where data does not get handled by any servers, but travels directly from disk to tape storage under the control of some kind of data mover. |
No comments:
Post a Comment
leave your opinion