I regularly copy and/or ship terabytes of raster data, but only recently revisited hardware I/O:
A 1 gigabit Ethernet card (NIC) costs $10, while a 10gbe NIC is around $200. Idealized math fills a 6TB drive over 1gbe in ~13.3 hours, but real world usage is often significantly slower. A quick survey of small office NASs shows they typically saturate their ethernet connections. For higher speeds, NAS vendors report the performance of multiple “teamed” 1gbe ports (AKA link aggregation). Want those speeds from your server to your NAS? Details suddenly matter – iSCSI multipath I/O might work for you, but good-old Windows file sharing likely won’t. Multichannel SMB is in development for SAMBA, meaning your linux-based NAS likely doesn’t support sending SMB traffic over “teamed” ports. If you’re using your NAS in iSCSI mode you’ve obviated your NAS’s built-in file server… you’re effectively using it as just another hard drive – aka Direct Attached Storage (DAS).
Direct Attached Storage – as it turns out – is harder to pin down. The consumer markets are occupied mostly by companies like Lacie which cater largely to video-editors on Mac platforms. Commodity HDDs can write at 150MB/s (1.2gb/s), hence USB 3’s 5gb/s should conceptually come close to saturating disk I/O on a 4-bay DAS. Many DAS units, however, offer Thunderbolt (10gb/s) connectivity, and finding USB3 perfomance details is often challenging. External SATA (6gb/s) or SAS (12gb/s) is yet another (rather elegant) option, but requires specialized components outside of most consumers’ wheelhouse.
There are certainly a lot of details that affect how fast these types of file-copy workflows will go. If your NAS has USB ports, you may be able to use your NAS’s UI or Offloaded Data Transfers (ODX) to facilitate faster copying. Under these conditions, NAS might be preferable to DAS. However, there are other reasons to consider direct attached storage. Although it’s a bit of a sidebar, a strong reason is to accommodate Windows users – likely the most expensive part of an IT organization.
The average “big data” user has specialized, deep knowledge of distributed systems and the cloud. The complexities of “big data” problems have warranted an operational overhaul. “Medium data” users are likely specialists in something – but not data management. These users typically run Windows locally, not Linux in the Amazon Cloud. They don’t want to drop ArcGIS for Desktop to learn MapReduce Geo, because they’re busy learning skydiving or how to be a better parent. $100 a month to Comcast gets them 10mb/s uploads; and it takes them fifty-six days(!!) to backup that 6TB hard drive into the cloud. Think of the potential productivity losses if the wrong user pursued this path…
This is not a criticism of “the cloud.” Far from it, there are lessons to be learned in “data locality.” Hadoop is a popular big-data buzzword of our time, but most people don’t know that Hadoop is less about “distributed processing” than it is about “data locality.” The H-A-D in Hadoop stands for “High Availability Distributed” – referring to Hadoop’s Distributed File System (HDFS). The beauty of HDFS – what we can learn from the cloud – is that data is stored on the same machines that will process it. The I/O limitations mentioned above with NAS vs DAS apply all the same, in the cloud or in enterprise.
As GIS professionals, we cannot turn a blind eye to our hardware and operating system infrastructures. Moore’s law has advanced processing to the point where data processing is cheaper than I/O infrastructures. Within our real-world “medium data” infrastructures, we must be careful to scale realistically and intelligently. Data must be stored close to where it is used; distributed processing is often for naught without a complimentary distributed storage. In short, the proverbial “IT Guy” acting alone might not be enough to optimize your enterprise GIS – get involved!