Introduction
This document describes the solution to the problem that occurs when ASR9K V1 DC power modules vanish from admin show platform. Lineage Version 1 (V1) Direct Current (DC) power supplies might not appear in inventory after it loses both power feeds.
Problem
When you test the DC power feeds and remove power to the DC power supplies, you check show platform and do not see the power supplies listed.
Here are the steps that you take at the time of the test:
1. DC power is connected and on to the top power tray/modules and the bottom power tray/modules.
2. Simulate full power failure and turn off all DC inputs to the power trays/module.
3. Restore DC input to top tray/modules only.
4. Wait for the device to boot up (still power off to the bottom tray/modules).
5. Restore DC input to the bottom tray/modules.
Upon restoration of the DC power inputs to the bottom tray, you see the power modules in admin show inventory power-supply and admin show platform. However, this is not the case.
Explanation:
Lineage power supplies will generate Inter-Integrated Circuit (I2C) errors if no DC power input is connected. That means you can detect their presence (this is done via a separate connection, not I2C), on power up you do not detect their presence. A present power module is not seen in admin show platform for the bottom tray cannot communicate with them to discover their state.
The power manager code will mark them as failed due to the high error counts that the supplies generate. The recovery method is an OIR of the supply.
There is a good explanation in the description section of CSCun46616: Power module I2C failure handling (V1 mostly).
It is replicated here:
An unpowered V1 module needs two voltages to detect its own address. These voltages are the 5V and 8V. The 5V is shared between both trays of the 9010, but the 8V is not. This means that if a module is plugged into an unpowered slot of a tray that doesn't have a powered module already, this module does not detect the correct address.
To handle this issue, the power modules driver code needs to change so that it does not continuously attempt the I2C access for failed modules, this might happen due to repeated (stuck) I2C alerts for modules that can't be accessed through I2C. These repeated I2C attempts cause delay in the power modules driver initialization which may prevent LCs from being allowed to boot by the shelf manager if it does not receive the chassis power allocation in time from the power modules driver.
This is exactly what happens in this case. None of the power supplies on the bottom shelf have any DC inputs so there is no 8V source for the tray and thus all of the modules in the bottom tray begins to generate I2C errors. The power manager marks all those modules as failed and does not attempt to recover them until they are replaced (i.e. OIR).
Solution
The system can be recovered when you restart two processes in this order:
process restart pwrmon
process restart shelfmgr