Blog

Back To Basics: Testing Industrial Control Systems 101

Industry has thrown down the gauntlet to vendors and suppliers of security testing solutions: Testing the network stack or communications protocols is insufficient to ensure safe and reliable operations of industrial components. While plenty of testing tools exist out there today, vendors and asset owners are showing us that you can’t simply tell me that the network stack is up or down. We need to know what jeopardizes uptime and most importantly, safety. A testing methodology that does not reveal the Failure on Demand calculations and provide a reasonable model of failure modes, predictability of failures, and provide accurate feedback as to exactly what to protect against does not meet the needs of industry.

It has been a while since I’ve looked at safety SIL in depth, but some recent research has reminded me of very important lessons of the past. In 1996, SP84 created ANSI/ISA-84.00.01-1996, which was later adopted as IEC 61508 for functional safety and IEC 61511 for process safety, and the game changed forever. No longer is it considered to be acceptable to release a device into the market place where the Failure on Demand rates, Mean Time Between Failure, and Mean Time to Repair are not known for a safety component. The industry developed and now regulatory bodies force vendors to test devices and know for certainty based on either:

  • Actual rigorous testing criteria
  • Empirical data to demonstrate reliability
Devices are not allowed to be introduced into SIL rated environments unless they can demonstrate compliance to the IEC standards for safety, period. Being unsafe from a device perspective is no longer tolerated. The post 9/11 economy has also made it no longer acceptable to just accept insecurity, and the lessons of safety will play well here for those that listen. The challenge is that many think it is not possible to test security the same way, something that we would counter. Formal test methodologies following sound engineering and design practices along with a keen understanding of network communications and device logic provides all the tools necessary for a similar model.

There are plenty of “security” testing tools out there, to be sure. The industrial process, however, is not a lab, and in most situations, there exists not only the high potential for systematic failures resulting in downtime and lost revenue, but most importantly the potential exists for safety problems. The increasing prevalence of industrial Ethernet is driving a paradigm shift in safety, however, and recent failures are waking industry up to the challenge.

Testing voltage, resistance, waveforms, and device behavior under test in safety is fairly well known. Hooking up an oscilloscope or putting a device under test with a volt and Ohm meter is little challenge today. But what about network communications? Faulting the logic inside the device can either cause the device to fail, behave erratically, but it can also result in conditions that actually affect I/O.

A few years ago I worked for a company that was having problems with customers complaining about device performance, and we found that many of them were using retail network cards and interfaces in industrial environments to save a few dollars. We had a hard time convincing people that this was a bad idea, so we commissioned a study. The study subjected industrialized or “hardened” network cards and cheaper retail cards (which are suitable for home use mostly), into an electronic noise chamber that tested components up to 2000 volts and then measured the transfer rates and responses times of the network traffic. The cheaper cards geometrically increased in transfer times starting at 700 volts, and eventually died before 1200 volts. Industrial cards stayed nearly flat all the way out to 2000 volts, and the results were clear. Network traffic could not be sustained in such conditions. Our customers took the results to heart, deploying better cards, and their network issues simply went away, and they measured the benefit in terms of increased efficiency.

This is a solid example of applying testing in the physical realm. We are controlling physics and as such they must be evaluated. Note of course that Achilles does not do electronic noise testing, but what it is UNIQUELY positioned to do is understand how network traffic can generate physical failures in a component. Simply enumerating errant behavior does little to demonstrate whether a device will cause a systematic failure or not… one must understand (just like in safety), what is the physical and electrical behavior of the failure.

At the end of the day, your testing methodology should always:
  1. Find potential faults and vulnerabilities
  2. Demonstrate conclusively (and not just theoretically) whether or not those vulnerabilities can cause the I/O to fail or the device’s physical function to be modified
  3. Simulate an actual environment: Devices rarely ever standalone, they work in concert with other devices. Should they not be tested in their deployed state as well? Testing only the device is simply a QA test. Testing the deployed architecture shows what will happen in production
  4. Isolate Device Under Test functionality to be sure that the true source of the fault can be viewed
  5. Record everything in near real time to ensure accurate information is recorded and the true faults are exposed.
Most tools out there today that test security will fulfill #1, and partially on the others… but modeling what is happening on the I/O and recording the device in its deployed architecture is not only a benefit of our testing methodology, it is absolutely essential to meet the current needs of safety and the emerging needs of security.