Moving Off FAA Mainframes: The Challenges of Transition

 
 
By Chris Preimesberger  |  Posted 2008-10-14
 
 
 

The mainframes running our air traffic flight plans and air traffic are beyond antiquated. They are flat-out dangerous say experts and former FAA employees, and the U.S. government agency has been lucky that they have not lead to any fatalities to date. Problems with the system have been happening for many years since the mainframe, legacy system was put into operation in January 1988.

That’s right: 1988.

On Aug. 26, 2008 a corrupt file entered the flight plan system and brought it down for about 90 minutes during a high-traffic period late in the day on the East Coast. This was not an isolated incident, as the FAA's chief administrator originally had told the media. Similar crashes occurred on Aug. 21 and in June, FAA records show—and with this systems 20 year history, there have been others.

International intelligence analytical firm Stratfor, in an analysis published on Aug. 27, reported a similar system outage back in 2000. Another was reported in June 2007 in addition to the Aug. 21 and Aug. 26 crashes. Those are the ones we know about; we don't know how many others were never made public information.

If the flight-plan system is suspect and is taking a long time to replace, what about the rest of the air traffic system?  What is its condition? The flight-plan and air traffic systems work hand in hand in coordination of the nation's air traffic.

The company that built the mainframes for the FAA in the 1980s, North American Philips, went belly up later in 1988, though the FAA was able to buy up all the remaining parts inventory the dying company had available at the time.

The National Airspace Data Interchange Network's (NADIN) current mainframe-based system is an integral part of the overall NAS (National Air Space) traffic system that processes an average of 1.5 million messages per day. As a result, industry analysts and a number of former Federal Aviation Administration staff members said they believe there is heightened likelihood of a major air traffic stoppage, as was demonstrated three times this summer by the crash of the system head in Atlanta. They also are concerned about increasing vulnerability to terrorist cyber-attacks.

People connected with this problem inside and outside the FAA agreed that the system needs to be upgraded as soon as possible. The main issues have been coming to an agreement on what kind of system to install for the long term and how to pay for it.

FAA Infrastructure: Air Traffic System History
Most localized air traffic control systems in use today were designed in the 1960s and '70s and installed throughout those years and into the '90s. Radar has been used since World War II.

Many technologies are used in air traffic control systems. Primary and secondary radar is used to enhance a controller's "situational awareness" within his assigned air space; all types of aircraft send back primary echoes of varying sizes to controllers' screens as radar energy is bounced off their skins. Transponder-equipped aircraft reply to secondary radar interrogations by giving an ID (Mode A), an altitude (Mode C) and/or a unique call sign (Mode S). Certain types of weather also may register on a radar screen.

The traffic-handling systems used at most international airports are highly proprietary. Systems engineers are tight-lipped about them in general. They work hand in hand with the flight-plan system and have many redundancies built into them.

Andy Isaksen, a computer scientist for the FAA in Atlanta, was the designer of the flight-plan system. In a 2005 NetworkWorld article, Isaksen told Deni Connor that the NADIN system's two Philips DS714/81 mainframe computers were originally manufactured in 1968 and upgraded with new processors in 1981. Since then, they have been getting increasingly harder to maintain, support and write code for, Isaksen said.

The Isaksen flight-plan network is the centerpiece of the FAA's air traffic system. Any aircraft that enters or leaves U.S. air space has to file a plan into the system. The network also serves as the sole data interchange between the United States and other nations to distribute flight plans for commercial and general aviation, as well as weather and advisory notices to pilots.

To its credit, the air traffic system probably has been running around the clock 99.9 percent of the time since the tail end of the Reagan administration. But the time has come for it to be replaced, and everybody knows it.

FAA Infrastructure: Flight-Plan System Rot
Stratfor, along with many other industry watchers, is very concerned about the flight-plan system and evidence that the system is wearing out.

"Regardless of what caused the Aug. 26 NADIN crash, [there] is a monumental challenge the event underscores. Here an archaic system that had survived nearly seven years of 9/11-inspired overhauls went down, dumping its entire workload on one other switch. The NADIN system had already been partially upgraded with systems from Lockheed Martin and is slated to be replaced altogether with the FAA's much-hyped NextGen Air Traffic Control system,” said Stratfor.

“But the lack of redundancy and dynamism demonstrated again by the latest NADIN crash makes a cyberattack against critical U.S. infrastructure all the more feasible. And the cost of comprehensively upgrading these systems would be an enormous financial investment, far more than we have seen so far in the years following 9/11."

A Web site blogged by a number of former FAA staff members, FAAFollies.com, details many of the foibles the agency has suffered in recent years, including these last two system crashes.

In March 2005, a new contract was awarded for a "NextGen" NADIN replacement. So the FAA has been well aware for at least four years that the old system has served its purpose and is ready to be replaced. In fact, the agency had been given warning as far back as 2000 (and perhaps even sooner than that) that the system was beginning to fail.

FAA Infrastructure: Anatomy of a Flight-Plan System Crash
The 90-minute system crash on Aug. 26, which pretty much affected all the major airports in the nation, later was blamed on a single corrupt file—most likely a virus—that had entered the system and somehow torpedoed it into uselessness.

"What happened yesterday at 1:25 p.m. [EDT] was that during a normal daily software load something was corrupted in a file, and that brought [the] system down in Atlanta," said FAA spokesperson Paul Takemoto. "Basically, all the flight plans that were in the system were kicked out. For aircraft already in the air, or [that] had just been pushed back from the gate, they had no problems. But for all other aircraft, it meant delays."

What made things worse was when operations were shifted to the backup facility in Salt Lake City, which is designed to handle 125 percent of the overall load, Takemoto said.

"It was far more than that [125 percent], because airlines were refiling their flight plans manually. They just kept hitting the 'Enter' button. So the queues immediately became huge," Takemoto said. "On top of that, it happened right during a peak time as traffic was building. Salt Lake City just couldn't keep up."
The second NADIN system in Salt Lake City, to its credit, continued normally in handling all the West Coast flight plans. But when Atlanta crashed, all the East Coast data switched over immediately to Salt Lake City, which could not handle the extra data traffic—even though it was designed to handle 125 percent of normal load in the event of an emergency.

Commercial aircraft of any type cannot take off with having filed a valid flight plan, one that includes destination, estimated flight speed, description of cargo, estimated altitude, weather conditions and a number of other data points.

So, for a part of the afternoon of Aug. 26, pilots at about 40 U.S. airports were forced to manually type their flight plan information into the system, causing long delays in takeoffs. Chicago's O'Hare International, one of the two or three busiest airports in the world, and nearby Midway Airport were among the most directly affected.

"We've just never seen it fail in this manner," Hank Krakowski, the chief operating officer for the FAA's Air Traffic Organization, said in his media remarks.

However, a look at the record shows it had indeed failed several times before, including only five days prior to the Aug. 26 crash. This excerpt comes from the FAA's own Web site (PDF format), dated Aug. 22:

"The aforementioned NADIN outage last evening [Aug. 21] caused more than 100 delays after flight plans were rejected. The outage is currently being blamed for 134 departure delays but this figure could climb. The legacy NADIN in Atlanta crashed. Salt Lake City took over but had problems with the high queue level …"

FAA Infrastructure: Progress with a New Flight System Plan
On Sept. 23, the FAA exercised the second option year on its 2006 SAVES (Strategic Sourcing for the Acquisition of Various Equipment and Supplies) contract, which was approved by Congress, to fund the systems upgrade. The IDIQ (indefinite delivery, indefinite quantity) contract will total $63 million after all options are exercised. 

To date, the FAA has spent about $23 million of that amount; GTSI is budgeted to spend about $13 million more this year.

The contract was awarded under the Federal Strategic Sourcing Initiative and is based on Office of Management and Budget mandates calling for agencies to consolidate their technology infrastructures.  So far, the SAVES program has helped the FAA standardize on technology, source goods and services more efficiently, and effectively monitor IT spending, an FAA spokesperson said.

Sun Microsystem's open-source OpenSolaris/ZFS/SunFire server/Thumper storage infrastructure—which features built-in, state-of-the-art virtualization capability—was a key building block on which the FAA IT evaluation group settled. Some of the new software is already being used in the air traffic system; ZFS (Sun's open-source Zettabyte File System) is being used in the FAA's air traffic data center.

"The FAA uses a large quantity of Sun Solaris servers in a variety of configurations to support some of our noncritical business applications," Andy Isaksen, manager of the Communications Infrastructure Engineering Team for NADIN and architect of the original mainframe system, said. "ZFS is being used on at least one service within the Air Traffic Organization Enterprise Data Center."

Isaksen said, "NADIN, which is responsible for flight plan distribution ... is nearing completion of our user migration waterfall. We began our migration to the new NADIN from our legacy system in March 2008 and the transition is scheduled to complete in early 2009. We are approximately 75 percent complete."

Whatever infrastructure NADIN uses, it is responsible for all flight plan distribution for hundreds of airports, and it provides the gateway between the aviation community and FAA, Isaksen said.

Commercial aircraft of any type cannot take off with having filed a valid flight plan, one that includes destination, estimated flight speed, description of cargo, estimated altitude, weather conditions and other data points.

The FAA augmented its old Phillips DS714 mainframes in 2005 at the FAA data centers in Atlanta, Ga., and Salt Lake City with Stratus FTserver 6400s, which run on Intel Xeon processors. However, the NADIN system, which is compliant with National Institute of Standards and Technology 800-53 security controls and operates on a private network, will keep evolving to the Sun-Cisco implementation.

The custom-built NADIN application is not hardware- and operating system-dependent and can be compiled to run on many server platforms, Isaksen said. This includes Solaris, so the changeover was not a major issue.

FAA Infrastructure: System Integrator's View
"What the FAA is doing is common to what a lot of other [government] agencies are doing: They're trying to do more standardization across their IT infrastructures," said Tom Kennedy, vice president of sales at GTSI, the systems integrator selected by the FAA.

One of the main requirements in the GTSI contract is that FAA wanted more control of equipment purchases. GTSI worked with the FAA's standardization committee, chaired by longtime FAA IT administrator Rick Jordan, to come up with the standards in upgrading the networking and server/storage parts of the systems, facilitate the buys and carry out the implementations, Kennedy said.

"Right now, we're mostly working on the non-NAS side of the FAA's IT," Kennedy said. "What we want to do is show success on this side of the system, and then bring it to the NAS side."

One of the main challenges GTSI faces is consolidation of storage devices, Kennedy said.

"They made a major investment in virtualization," Kennedy said. "In their [previous] environment, they had disparate storage devices from multiple vendors, all across the FAA. They're now upgrading or consolidating them, via the standards. Now they've honed in down to one platform."

And that would be Sun-Cisco.

"SAVES was a pretty high initiative that came out of the [U.S.] CIO's office. The first year, it seemed like there was some resistance [to the upgrade] internally, as people were getting comfortable with actually having [new] standards," Kennedy said. "But since our implementation, we've done two times the volume [of data transactions] this year versus last year."

That kind of performance will catch attention of bureaucrats every time.

FAA Infrastructure: Security Upfront
In addition to the new Sun infrastructure, the FAA also has taken measures to tighten security from all access points.

ForeScout Technologies, a network access control and policy management provider for large enterprises, was selected to supply a number of its CounterACT network appliances to the FAA's SAVES contract with GTSI.

CounterACT was approved as an agency standard by the FAA's Technology Control Board. FAA networks throughout the United States are now using CounterACT to improve network access.

ForeScout President Gord Boyce said CounterACT combines clientless network access control and malicious threat detection to ensure that connected (and, importantly, connecting) devices are in compliance with network security policies and are free of self-propagating threats.

CounterACT seamlessly integrates into any network environment without requiring costly upgrades or infrastructure changes, Boyce said. It also enables enterprises to tailor enforcement actions to match the level of policy violations, ensuring that user disruption occurs only when it is warranted or required by the IT staff, he added.

"The FAA did a nine-month deep dive to make sure our product met their requirements," Boyce said. "The meat of their business-side deployment is just now beginning. They expect to roll us out to the rest of their network over the next nine months."

Not only will CounterACT give the FAA the security to lock down their network, Boyce said, but it also will allow "understanding as to what's on their network, and the knowledge to know what their network looks like."

CounterACT can see any device that attempts to obtain an IP address, Boyce said. "One of our biggest differentiators in the market is the fact that we are clientless. We don't need to have any prior knowledge of a device as it connects to your network," he said.

"Whether that's an IP phone, an IP printer, a contractor that you've never seen before, a managed desktop or laptop—anything that wants to get an IP address, we're going to be able to identify and interrogate it, and do some sort of a policy enforcement on it."