App note: Using a hardware or software CRC with enhanced core PIC16F1XXX in class B applications

DP January 3, 2015 15 Comments

apps

An app note from Microchip: Using a hardware or software CRC with enhanced core PIC16F1XXX in class B applications (PDF!)

Class B safety routines are increasingly used in microcontrollers to detect faults in safety-critical applications. The primary method for detecting faults in microcontroller program memory is by using a Cyclic Redundancy Check (CRC) as defined by the IEC 60730 standard.
A CRC can be used to prevent application faults due to corrupted program memory by performing a periodic check to determine if the check value has changed.
This application note will describe how to implement the Software CRC available as part of the Class B Safety Software Library and the hardware CRC used in selected microcontrollers (this document will focus on the PIC16F161X family).
Both methods discussed in this application note satisfy IEC 60730 spec H.2.19.3.2 to test Invariable Memory for all single-bit faults with 99.6% coverage.

Join the Conversation

15 Comments

rumburack says:

January 3, 2015 at 9:48 pm

What on earth and in space could cause bit errors? And if, what help would it be to detect 99,6% of these errors and miss a wopping 0,4% of it?

I mean, in practice, does this really help or isn’t it snake oil?

Reply
1. michal says:
  
  January 3, 2015 at 10:58 pm
  
  There is this thing called radiation and high energy particles that can flip your ram’s bits.
  
  Even more likely to happen on pico sats that fly to orbit, but can potentially happen on earth as well.
  
  Reply
  1. rumburack says:
    
    January 4, 2015 at 9:58 am
    
    That’s what I was cogitating about….
    
    How often do bits flip? And how does it help me if I still miss 0,4% of these flips? And what to do if I detect a flip?
2. KH says:
  
  January 4, 2015 at 11:42 am
  
  Reset won’t help, for starters. You can use a second copy of the program if the first one is bad. Then alert the user. Or command the hardware into a safe position, then stop.
  Note that this item is only for invariant memory (program/data flash). There are plenty more work items in IEC 60730 that you’d need to implement for resilient appliances.
  
  99.6% is merely a standard spec. CRC can scan the entire program space of course, or split memory into smaller blocks to optimize detection of bit flips. Even though CRC on new PICs can scan largely automatically, is still dependent on a small part of the program still being able to function to take appropriate action. If you test that part of the program too, then the tester must still function, ad infinitum. I think 100% is hard to do on commodity MCUs. Something like a Boeing 777 I believe have 3 separate and different processors running different programs and doing voting.
  
  Bit(s) may either flip after readout (upset event on register or logic) or flash cell(s) may go bad. Of course MCU flash cells are much more resilient than SSD flash cells, but it’s still storing trapped electrons. As for how often, it depends on the radiation/particle environment and various aspects of the IC.
  
  Reply
3. Torkell says:
  
  January 4, 2015 at 6:56 pm
  
  It can happen – one product I used to work with would occasionally encounter invalid instructions. After a lot of debugging (including adding a module between the CPU and the RAM to detect if the CPU was writing to where it shouldn’t be), the cause was eventually discovered to be a flawed hardware design where the system bus was marginal in some way (I forget how – perhaps poor termination or capacitive coupling from nearby tracks) and every so often, a bit would get flipped on its way from the RAM to the CPU. So a background CRC check was added to detect and warn when corruption was detected, hopefully before the CPU misread a buffer and jumped off into never-never-land.
  
  Reply
  1. rumburack says:
    
    January 4, 2015 at 10:51 pm
    
    That’s what I meant with snake oil.
    
    Instead of solving the problem another layer of not really working problem detection (without solution) is added.
  2. Torkell says:
    
    January 5, 2015 at 12:11 am
    
    I agree the ideal solution would have been a redesigned board, but that’s also the expensive solution. In practice flipped bits were rare, and the background check did catch some of them. We also added an option to automatically reload the firmware from flash and continue without a restart – possible as this system had no relocatable or dynamically-generated code.
    
    Of course, this assumes the flipped bit isn’t in your CRC-checking code…
  3. KH says:
    
    January 5, 2015 at 4:07 am
    
    No, IMHO it’s not snake oil. Even if everything else is working, someone using say a 10 year old appliance that is regularly left outdoors or in a hot car may well benefit from better flash resilience.
    It’s important to know that there are 1001 methods in this area and they target different aspects of the (very wide or diverse) problem. Loosely speaking, to be robust, there is no one magical method that can solve all our problems, better have at least some defence in depth.
Max says:

January 5, 2015 at 9:40 am

Yes and no. Much like a firewall, it does help when it’s applied. But does a firewall make you “safe”? Hell no, you’re just somewhat safer. So, while it’s not pointless, it is snake-oil in the sense that it absolutely does not make your application suddenly and magically “safe”, no matter how much checking you do. All every possible check in the world does is buy you a tiny bit more safety, but none of it can make you completely safe – and it certainly is a game of diminishing results. That said, in relatively critical applications it’s better than not having it, as long as you’re aware of what it actually gets you…

Reply
1. KH says:
  
  January 5, 2015 at 10:14 am
  
  I agree, except when someone uses the term “snake oil” it is usually with pejorative connotations. There are no simple answers. Super safe and reliable thingies are hard. IEC 60730 is going in the right direction. We should also keep in mind that avionics-level reliability is not cheap. Let us remember, it’s hard to claim that any method is perfect. Say if one uses 3-way majority voting in digital logic, if the Flying Spaghetti Monster want to mess with you, He still can flip 2 or 3 paths and give you a wrong result… ;-)
  
  Reply
  1. rumburack says:
    
    January 5, 2015 at 10:28 am
    
    I used the term “snake oil” for a promise of a solution that does not deliver. And prevents the user from implementing a real solution.
    
    I had the bad taste of “hey, look, pointy haired boss, we have another bullet on our feature list” instead of manufacturing a hardened version of the ic or writing some lines about shielding the ic.
    
    And to educate me due to the resulting discussion.
    
    (“personal firewalls” are some of the best examples of snake oil…)
  2. KH says:
    
    January 5, 2015 at 11:34 am
    
    Oh okay. I would say this thing is more in the class of EU Nanny State Über Bureaucracy’s Directive That Citizens Should Follow For Their Own Good. Of course usually they mean well, but Brussels is just too powerful.
    
    I disagree with the “snake oil” term. It didn’t make sense to me for this particular thing. Or, painting things in this safety/reliability field as “snake oil” is like a “lashing out” reaction.
    
    I can’t figure out why testing invariant memory can somehow detract from other ‘real’ solutions. What else can you do with program memory? Can you propose to me another fundamentally different way of testing program flash? Or if you don’t check it this way then how do you know it’s the correct program that is running? Surely the automotive and the medical device industry are checking their programs… how else are they checking their programs?
rumburack says:

January 5, 2015 at 11:51 am

To be honest, I have no clue about how to avoid bit flipping ram. That’s why I want to see the discussion, I find the topic is quit interesting…. I hoped that someone had first hand knowledge and could feed me with some web pointers or numbers about how often and where and when and such. :-)

I could think about some ideas how to solve the problem, but due to cluelessness…Shield the device. Put some layers of copper/whatever above the ram cells inside the IC? Like those reverse engineering protection stuff? Add external shielding? What to use? Create a chip with ECC cells, quite a different design… Why use CRC when error correcting software codes exist? (use some memory for a correction map and have a way more useful solution). Run two microcontrollers in parallel and sync/compare them somehow? If I create something that flies to the moon or carries PAX around, how much cost is acceptable?

All this seems more reliable to me than CRC with still sitting on 0,4% misses.

Reply
1. KH says:
  
  January 5, 2015 at 1:20 pm
  
  Shields? Sure, it can help — to some level. Note that most methods mitigate upset events. Cosmic rays collide with the atmosphere and higher energy ones shower us with all sorts of particles — muons can penetrate a mile of rock. An extreme X-class flare from the Sun may still disable many well-designed satellites. There is no “super solution”. If there were, US space infrastructure would be using it long ago. As it is, US DoD have spent lots to study this, I doubt small ants like us can do better…
  
  Anyway you should be able to find many papers via Google searches, start from SEU and branch out from that. Also learn about faults, errors and failures. They characterize rad-hard chips using particle beams, there are facilities for doing such things in academic or corporate research.
  
  This Microchip doc for the PIC16 family is mainly aimed at white goods compliance to EU rules, that’s all. White goods (appliances) are extremely cost-sensitive, we will comply with EU, we will not move the heaven and the earth to make perfect thingies. But when cost is no object, of course things will be much more flexible. Space avionics will use voting circuits, are fabbed as rad-hard, and have stuff like redundancy and plenty of methods to recover from anomalies. Hardly ideal, yet many smart and dedicated people have worked on this problem and have no “magic wand” solutions to offer…
  
  Reply
Sleepwalker3 says:

January 5, 2015 at 1:23 pm

This is getting into quite a discussion and the best place for that is the forum, so I’ve started a forum thread for it. Can everybody please put any further comments over on the forum at this address please.
http://dangerousprototypes.com/forum/viewtopic.php?f=2&t=6922

Reply

App note: Using a hardware or software CRC with enhanced core PIC16F1XXX in class B applications

Join the Conversation

Cancel reply

Leave a comment