Products
Scyld Beowulf Professional Edition
Scyld Enterprise Solutions
Scyld Training
Scyld Beowulf Overview

Support
Documentation
FAQs
Network Drivers
Network Diagnostics
Network driver mailing lists

Vendors

About Scyld
Press Releases
Employment
Contact

Search

Linux PCI, Hot-Swap PCI and CardBus Adapter Support

This document describes support code for PCI and CardBus adapters on the Linux Operating System.

Goals

This code is designed as support code, rather than an abstraction layer. Instead of implementing every possible way to scan and activate every piece of strange hardware, it is designed to replace the boilerplate scan code that is largely duplicated in older PCI drivers. Having a guideline for implementing bus scans that considers hot-swap PCI and CardBus devices makes more likely that device drivers will be easily updated to handle conversions to these interfaces types.

Note: Hot-swap PCI and CardBus devices, where devices may be suspended, removed and replaced, or additional devices added to an already installed driver, have required significant updates to drivers written using older driver guidelines.

Design

Existing practice

Typical PCI drivers in Linux implement an ad hoc scan of the PCI bus hierarchy, searching for devices they support. As each device is found the resources it uses, typically an address range and IRQ, are read from PCI configuration space. I/O space regions are checked for conflicts (the traditional Linux mechanism for insuring that a driver has exclusive access to a device), the address ranges are registered as controlled by the driver, and memory regions are mapped into the kernel's address space.

Careful drivers make additional checks to verify that the device has usable settings for IRQ and memory regions, I/O and memory regions are enabled in the PCI command register, bus master capability is enabled, and the PCI latency register has a reasonable setting.

The pci-scan support code replaces the largely duplicated PCI scan code with a static table of struct pci_id_info, which is passed to the function pci_drv_register(). The following example shows how a hundred lines of PCI scanning code in the epic100.c driver are replaced by table and a function call. The table may be easily added to when new chips are released, while the ad hoc scan had code that explicitly checks for acceptable chips.

static void *epic_probe1(struct pci_dev *pdev, void *init_dev, long ioaddr, int irq, int chip_id, int card_idx); static int epic_pwr_event(void *dev_instance, int event); #define EPIC_IOTYPE PCI_USES_MASTER|PCI_USES_MEM|PCI_ADDR1 static struct pci_id_info pci_tbl[] = { {"SMSC EPIC/C 83c172", {0x000510B8, 0x7fffffff, 0, 0, 7, 0xff} EPIC_IOTYPE, EPIC_TOTAL_SIZE, HAS_ACPI}, {"SMSC EPIC/100 83c170", {0x000510B8, 0x7fffffff,} EPIC_IOTYPE, EPIC_TOTAL_SIZE, TYPE2_INTR}, {"SMSC EPIC/C 83c175", {0x000610B8, 0x7fffffff,} EPIC_IOTYPE, EPIC_TOTAL_SIZE, HAS_ACPI | MII_PWRDWN}, {0,}, /* 0 terminated table. */ }; struct drv_id_info epic_drv_id = { "epic100", PCI_HOTSWAP, PCI_CLASS_NETWORK_ETHERNET<<8, EPIC_PWR_EVENT PCI_TBL, }; EPIC_PROBE1, This table is then used in a call to netif_pci_probe() int init_module(void) { return pci_drv_register(&epic_drv_id, NULL); }

Note: The full driver follows the recommendation of always emitting source code version information, along with a second message if no devices are found.

Restating our original goal: This code isn't designed as an isolation or abstraction layer around PCI configuration space. An abstraction layer tries to make the underlying implementation opaque, while this interface has the goal of making typical drivers simpler and smaller by extracting commonly used code. The model is that every device will need a activated base address and an IRQ, and most drivers won't need to interact with PCI configuration if these are provided. For additional resources, such as addresses from several PCI registers, the driver is still free to directly access PCI configuration space.

The `pci_id_info` table.

The fields in the pci_id_info table entries may be logically grouped into two sets. The first set is identifying and naming the device. The second set is attaching the device to the system.

Device Identity and Matching

The first field is an arbitrary text name of the device. It is not interpreted in any way, and is used only for kernel messages. The next field is a substructure the describes how to find cards based on their PCI configuration space ID, subsystem ID and revision. Each logical element contains two values, the first is the value to match, and the second is a bitmask. The value read from PCI configuration space is ANDed with the bitmask before being compared to the match value. Unspecified elements in C structures default to '0', so drivers that don't care about subsystem IDs and chip revision information only need to specify the vendor/device ID and a bitmask of 0xffffffff.

The search is done in table order, so more specific entries should be placed before "catch-all" entries. This allows us to support unknown or future chip types using a generic interface. It also allows us to identify boards by name while falling back to the generic name for the unknown subsystem IDs. This feature is technically pointless, but reduces the incentive for hardware vendors to distribute trivially hacked drivers.

Note: PCI IDs are frequently documented as separate 16 bit vendor and device ID halves. The code above expects them as a single 32 bit value.

Device mapping

The next three values are used for checking and mapping the device's address space usage. The first value describing the usage of PCI base address register and features. The example above is a typical setting PCI_USES_MASTER | PCI_USES_MEM | PCI_ADDR1 This has the scan code enable the PCI bus master and memory mapping capabilities, and pass in PCI base address register 1. A simpler device, such a NE2000, that supports only PCI target accesses in I/O space would use PCI_USES_IO | PCI_ADDR0. A 64 bit address register is referenced by the lower register in the pair and the flag PCI_ADDR_64BITS. The list of other settings are described in pci-scan.h.

The next value is the extent of the I/O or memory region.

Finally, since drivers usually support several chip types, the table contains a field reserved for driver use named 'drv_flags'. This is typically treated a capabilities bitmap that condenses the features and differences between chip generations. In some cases it's used as an index value into a second table local to the driver. In either case the driver should be written so that adding a table entries doesn't require changing the remainder of the driver source code.

The `drv_id_info` structure

A second structure is used to pass the driver nickname, flags that describe features of the driver, the device class, the probe1/attach routine and an optional power state control routine.

The nickname is a short (<= NAME NOT THAT CONSIST OF MAY AND IT ENTIRELY

The driver flags field is a list of driver features. A typical flag is PCI_HOTSWAP, which indicates that the driver supports features needed for CardBus-like dynamic attach and detach semantics. This flag directs the driver to retain the struct pci_id_info table, so the table may not be marked as __initdata.

The driver attach function

When a matching device is found the attach routine, usually called probe1(), is called. This returns either a pointer to the device or 0, indicating failure. The returned device pointer is used for subsequent power control calls, and is uninterpreted except in one specific case: for CardBus network adapters it must be a struct net_device * with a valid dev->name field.

The first parameter to probe1() is the PCI location, a struct pci_dev *pdev). The next parameter is a void * to an initial device. For network devices this a struct net_device *, which is typically NULL, but may be non-NULL for built-in drivers.

The PCIADDR and IRQ parameters are next. Memory regions are mapped into the kernel address space, and the virtual address passed to probe1(). I/O regions are checked using check_region(), but not registered (the device name e.g. "eth0" is not known). If either the mapping or the I/O space check fails the device attach function will not be called.

Note: A future extension may check for a valid IRQ as well. PCI BIOSes use both IRQ255 and IRQ0 to indicate an invalid setting.

Finally, the probe1() routine is passed the index into pci_id_info table and a count of the devices found before this device, or zero if that count isn't known or doesn't make sense. Using '0' when the device count changes or isn't known allows passing driver parameters to CardBus devices in an easy to document way.

The Power Event function

If an optional power event function is provided it may be called with various power control commands. The first action is usually DRV_ATTACH, which indicates that the driver should expect a DRV_DETACH final action. The DRV_SUSPEND and DRV_RESUME actions match the existing CardBus semantics, with a suspend always followed by a resume or detach. The DRV_PWR_WakeOn is a variation of the suspend where, if stand-by power is power is available, the device should enable its any wake-on features. The DRV_PWR_DOWN and DRV_PWR_UP routine change the power state to minimal and standard power levels.

Implementation notes: Attach and detach are used for maintaining the module use count and freeing resources from probe1(). Detach is legal from the suspend and WakeOn state without a matching resume. WakeOn, Up and Down power level may be set without using attach first.

ACPI Support

Most modern PCI devices have power management support. While Linux doesn't yet have an ACPI infrastructure, individual drivers support ACPI features. Basic drivers may ignore ACPI support, however activating a device from D3 full-suspended state to D0 active state is a common requirement e.g. when another OS has left the device is the D3 state. Most devices require saving and restoring all PCI configuration space registers when transitioning from D3 state, since they do an internal power-on reset.

We provide a utility (non-abstraction) function to minimize the duplicated driver code:

int acpi_set_pwr_state(int bus, int devfn, enum acpi_pwr_state new_state) This routine sets the device's power state, correctly handling the wake-up (D3->D*) transition.

This is implemented using two exported functions that may be generally useful.

int pci_find_capability(int bus, int devfn, int findtype) Used to find the offset of the extended ACPI capability structure, usually PCI_CAP_ID_PM. int acpi_wake(int bus, int devfn) Used to set the device to D0 state.

No other ACPI functions are provided. Other code that relates to power management is either trivial (e.g. setting Wake-On-LAN), or complex in a very device and driver specific way.

Design notes:
Open Implementation Questions, Known Issues and Limitations

We support matching by chip revision. A design goal has been to make the table entries light-weight to encourage their use. Using revision information adds 8 bytes to each table entry, and thus runs counter to that goal. But it does help localize chip capability information, as it's common to change the revision number when new features or bug fixes are added. Some chips (e.g. the Digital 21142/21143) even change part number with just a revision number change.

We support matching by subsystem ID. Most board vendors just put the chip on a board with a generic design, so subsystem IDs are rarely useful from a technical viewpoint. But experience has shown that board vendors will do a version split for the sole purpose of having their name show up in the device recognition message. A case where the subsystem ID differentiation is required is boards using the PLX PCI interface chips as a bridge to non-PCI chips. All such boards shows up as PLX devices, even though they are unrelated devices.

Matching is done using a combined 32 bit vendor and device ID. This value is often documented as two 16 bit half-values. In part this difference is intentional to encourage people to not use the constants defined in linux/pci.h. Those defines are portability problems, and are sometimes misleading. Several vendors have multiple Vendor IDs, and linux/pci.h has numerous examples of just-plain-wrong device names. There is no reason for having a symbolic name in place of an explicit numeric values for an assigned, permanent identity constants.

A future extension may check for a valid IRQ, with a flag for devices that do not require a valid IRQ. PCI BIOSes use both IRQ255 and IRQ0 so the code will select a single value named PCI_INVALID_IRQ. Note: IRQ0 was originally documented as valid, with IRQ255 being the proper value for 'unassigned'. But so many x86 BIOSes incorrectly used IRQ0 instead of IRQ255 that they now both mean 'unassigned'. Recent kernels map both values to IRQ0, but support for older kernels must handle this explicitly.

You might expect that the design would have attach(), suspend(), resume() and detach() functions as the current CardBus code does. However there are other similar actions, such "go to low power mode" that we might reasonably add in the future, and attach/probe1 has a complex, unique calling sequence, unlike the other actions. Instead these functions are part of the do-everything pwr_event() entry point.

Note that the PCI_COMMAND_MASTER bit is set before probe1() is called. The code currently sets this enable bit at the same time I/O or memory space access is enabled. This allows self-test code in probe1() to including a bus-master test. Some datasheets suggest the master bit should not be enabled until after the chip is reset, in case a old transfer is in progress. But the BIOS should have reset the chip at the same time it disabled master capability.

The current code rewrites the PCI latency register for BIOSes that leave it at zero or a very low value. This is a questionable practice, although some devices require it for proper operation. The PCI_NO_MIN_LATENCY flag disables this. The usual need is for a setting of at least 10, with a higher requirement uncommon. For instance, the 3Com 3c590 series adapters requires their maximum possible of 248 setting to avoid a design problem.

The code activates the device to ACPI D0 state before the probe1() code is called, unless the PCI_NO_ACPI_WAKE flag is set. Not all devices need to be awakened when scanned, but most (especially Ethernet adapters) do. Many newer MS-Windows drivers leave the hardware in D3 state, which commonly persists through a warm boot. Activating the device permits drivers to ignore ACPI, which is especially useful since ACPI is a feature that is silently added to the hardware after the driver is written.

Linux Network Drivers Page
SCYLD information.
Author: Donald Becker
See the drivers for the contact email address. Do not bother sending email to zinc.anode@scyld.com, as email to that address adds your domain or IP address to the known-spammer list.






Products Scyld Beowulf Professional Edition Scyld Enterprise Solutions Scyld Training Scyld Beowulf Overview Support Documentation FAQs Network Drivers Network Diagnostics Network driver mailing lists Vendors About Scyld Press Releases Employment Contact Search	Linux PCI, Hot-Swap PCI and CardBus Adapter Support This document describes support code for PCI and CardBus adapters on the Linux Operating System. Goals This code is designed as support code, rather than an abstraction layer. Instead of implementing every possible way to scan and activate every piece of strange hardware, it is designed to replace the boilerplate scan code that is largely duplicated in older PCI drivers. Having a guideline for implementing bus scans that considers hot-swap PCI and CardBus devices makes more likely that device drivers will be easily updated to handle conversions to these interfaces types. Note: Hot-swap PCI and CardBus devices, where devices may be suspended, removed and replaced, or additional devices added to an already installed driver, have required significant updates to drivers written using older driver guidelines. Design Existing practice Typical PCI drivers in Linux implement an ad hoc scan of the PCI bus hierarchy, searching for devices they support. As each device is found the resources it uses, typically an address range and IRQ, are read from PCI configuration space. I/O space regions are checked for conflicts (the traditional Linux mechanism for insuring that a driver has exclusive access to a device), the address ranges are registered as controlled by the driver, and memory regions are mapped into the kernel's address space. Careful drivers make additional checks to verify that the device has usable settings for IRQ and memory regions, I/O and memory regions are enabled in the PCI command register, bus master capability is enabled, and the PCI latency register has a reasonable setting. The pci-scan support code replaces the largely duplicated PCI scan code with a static table of struct pci_id_info, which is passed to the function pci_drv_register(). The following example shows how a hundred lines of PCI scanning code in the epic100.c driver are replaced by table and a function call. The table may be easily added to when new chips are released, while the ad hoc scan had code that explicitly checks for acceptable chips. static void epic_probe1(struct pci_dev pdev, void init_dev, long ioaddr, int irq, int chip_id, int card_idx); static int epic_pwr_event(void dev_instance, int event); #define EPIC_IOTYPE PCI_USES_MASTER\|PCI_USES_MEM\|PCI_ADDR1 static struct pci_id_info pci_tbl[] = { {"SMSC EPIC/C 83c172", {0x000510B8, 0x7fffffff, 0, 0, 7, 0xff} EPIC_IOTYPE, EPIC_TOTAL_SIZE, HAS_ACPI}, {"SMSC EPIC/100 83c170", {0x000510B8, 0x7fffffff,} EPIC_IOTYPE, EPIC_TOTAL_SIZE, TYPE2_INTR}, {"SMSC EPIC/C 83c175", {0x000610B8, 0x7fffffff,} EPIC_IOTYPE, EPIC_TOTAL_SIZE, HAS_ACPI \| MII_PWRDWN}, {0,}, /* 0 terminated table. / }; struct drv_id_info epic_drv_id = { "epic100", PCI_HOTSWAP, PCI_CLASS_NETWORK_ETHERNET<<8, EPIC_PWR_EVENT PCI_TBL, }; EPIC_PROBE1, This table is then used in a call to netif_pci_probe() int init_module(void) { return pci_drv_register(&epic_drv_id, NULL); } Note: The full driver follows the recommendation of always emitting source code version information, along with a second message if no devices are found. Restating our original goal: This code isn't designed as an isolation or abstraction layer around PCI configuration space. An abstraction layer tries to make the underlying implementation opaque, while this interface has the goal of making typical drivers simpler and smaller by extracting commonly used code. The model is that every device will need a activated base address and an IRQ, and most drivers won't need to interact with PCI configuration if these are provided. For additional resources, such as addresses from several PCI registers, the driver is still free to directly access PCI configuration space. The `pci_id_info` table. The fields in the pci_id_info table entries may be logically grouped into two sets. The first set is identifying and naming the device. The second set is attaching the device to the system. Device Identity and Matching The first field is an arbitrary text name of the device. It is not interpreted in any way, and is used only for kernel messages. The next field is a substructure the describes how to find cards based on their PCI configuration space ID, subsystem ID and revision. Each logical element contains two values, the first is the value to match, and the second is a bitmask. The value read from PCI configuration space is ANDed with the bitmask before being compared to the match value. Unspecified elements in C structures default to '0', so drivers that don't care about subsystem IDs and chip revision information only need to specify the vendor/device ID and a bitmask of 0xffffffff. The search is done in table order, so more specific entries should be placed before "catch-all" entries. This allows us to support unknown or future chip types using a generic interface. It also allows us to identify boards by name while falling back to the generic name for the unknown subsystem IDs. This feature is technically pointless, but reduces the incentive for hardware vendors to distribute trivially hacked drivers. Note: PCI IDs are frequently documented as separate 16 bit vendor and device ID halves. The code above expects them as a single 32 bit value. Device mapping The next three values are used for checking and mapping the device's address space usage. The first value describing the usage of PCI base address register and features. The example above is a typical setting `PCI_USES_MASTER \| PCI_USES_MEM \| PCI_ADDR1` This has the scan code enable the PCI bus master and memory mapping capabilities, and pass in PCI base address register 1. A simpler device, such a NE2000, that supports only PCI target accesses in I/O space would use `PCI_USES_IO \| PCI_ADDR0`. A 64 bit address register is referenced by the lower register in the pair and the flag `PCI_ADDR_64BITS`. The list of other settings are described in pci-scan.h. The next value is the extent of the I/O or memory region. Finally, since drivers usually support several chip types, the table contains a field reserved for driver use named 'drv_flags'. This is typically treated a capabilities bitmap that condenses the features and differences between chip generations. In some cases it's used as an index value into a second table local to the driver. In either case the driver should be written so that adding a table entries doesn't require changing the remainder of the driver source code. The `drv_id_info` structure A second structure is used to pass the driver nickname, flags that describe features of the driver, the device class, the probe1/attach routine and an optional power state control routine. The nickname is a short (<= NAME NOT THAT CONSIST OF MAY AND IT ENTIRELY The driver flags field is a list of driver features. A typical flag is PCI_HOTSWAP, which indicates that the driver supports features needed for CardBus-like dynamic attach and detach semantics. This flag directs the driver to retain the `struct pci_id_info` table, so the table may not be marked as `__initdata`. The driver attach function When a matching device is found the attach routine, usually called probe1(), is called. This returns either a pointer to the device or 0, indicating failure. The returned device pointer is used for subsequent power control calls, and is uninterpreted except in one specific case: for CardBus network adapters it must be a `struct net_device ` with a valid dev->name field. The first parameter to probe1() is the PCI location, a `struct pci_dev pdev`). The next parameter is a `void ` to an initial device. For network devices this a `struct net_device `, which is typically `NULL`, but may be non-NULL for built-in drivers. The PCIADDR and IRQ parameters are next. Memory regions are mapped into the kernel address space, and the virtual address passed to `probe1()`. I/O regions are checked using `check_region()`, but not registered (the device name e.g. "eth0" is not known). If either the mapping or the I/O space check fails the device attach function will not be called. Note: A future extension may check for a valid IRQ as well. PCI BIOSes use both IRQ255 and IRQ0 to indicate an invalid setting. Finally, the `probe1()` routine is passed the index into pci_id_info table and a count of the devices found before this device, or zero if that count isn't known or doesn't make sense. Using '0' when the device count changes or isn't known allows passing driver parameters to CardBus devices in an easy to document way. The Power Event function If an optional power event function is provided it may be called with various power control commands. The first action is usually DRV_ATTACH, which indicates that the driver should expect a DRV_DETACH final action. The DRV_SUSPEND and DRV_RESUME actions match the existing CardBus semantics, with a suspend always followed by a resume or detach. The DRV_PWR_WakeOn is a variation of the suspend where, if stand-by power is power is available, the device should enable its any wake-on features. The DRV_PWR_DOWN and DRV_PWR_UP routine change the power state to minimal and standard power levels. Implementation notes:* Attach and detach are used for maintaining the module use count and freeing resources from probe1(). Detach is legal from the suspend and WakeOn state without a matching resume. WakeOn, Up and Down power level may be set without using attach first. ACPI Support Most modern PCI devices have power management support. While Linux doesn't yet have an ACPI infrastructure, individual drivers support ACPI features. Basic drivers may ignore ACPI support, however activating a device from D3 full-suspended state to D0 active state is a common requirement e.g. when another OS has left the device is the D3 state. Most devices require saving and restoring all PCI configuration space registers when transitioning from D3 state, since they do an internal power-on reset. We provide a utility (non-abstraction) function to minimize the duplicated driver code: int acpi_set_pwr_state(int bus, int devfn, enum acpi_pwr_state new_state) This routine sets the device's power state, correctly handling the wake-up (D3->D) transition. This is implemented using two exported functions that may be generally useful. int pci_find_capability(int bus, int devfn, int findtype) Used to find the offset of the extended ACPI capability structure, usually PCI_CAP_ID_PM. int acpi_wake(int bus, int devfn) Used to set the device to D0 state. No other ACPI functions are provided. Other code that relates to power management is either trivial (e.g. setting Wake-On-LAN), or complex in a very device and driver specific way. Design notes: Open Implementation Questions, Known Issues and Limitations We support matching by chip revision. A design goal has been to make the table entries light-weight to encourage their use. Using revision information adds 8 bytes to each table entry, and thus runs counter to that goal. But it does help localize chip capability information, as it's common to change the revision number when new features or bug fixes are added. Some chips (e.g. the Digital 21142/21143) even change part number with just a revision number change. We support matching by subsystem ID. Most board vendors just put the chip on a board with a generic design, so subsystem IDs are rarely useful from a technical viewpoint. But experience has shown that board vendors will do a version split for the sole purpose of having their name show up in the device recognition message. A case where the subsystem ID differentiation is required is boards using the PLX PCI interface chips as a bridge to non-PCI chips. All such boards shows up as PLX devices, even though they are unrelated devices. Matching is done using a combined 32 bit vendor and device ID. This value is often documented as two 16 bit half-values. In part this difference is intentional to encourage people to not use the constants defined in linux/pci.h. Those defines are portability problems, and are sometimes misleading. Several vendors have multiple Vendor IDs, and linux/pci.h has numerous examples of just-plain-wrong device names. There is no reason for having a symbolic name in place of an explicit numeric values for an assigned, permanent identity constants. A future extension may check for a valid IRQ, with a flag for devices that do not require a valid IRQ. PCI BIOSes use both IRQ255 and IRQ0 so the code will select a single value named PCI_INVALID_IRQ. Note: IRQ0 was originally documented as valid, with IRQ255 being the proper value for 'unassigned'. But so many x86 BIOSes incorrectly used IRQ0 instead of IRQ255 that they now both mean 'unassigned'. Recent kernels map both values to IRQ0, but support for older kernels must handle this explicitly. You might expect that the design would have attach(), suspend(), resume() and detach() functions as the current CardBus code does. However there are other similar actions, such "go to low power mode" that we might reasonably add in the future, and attach/probe1 has a complex, unique calling sequence, unlike the other actions. Instead these functions are part of the do-everything pwr_event() entry point. Note that the PCI_COMMAND_MASTER bit is set before probe1() is called. The code currently sets this enable bit at the same time I/O or memory space access is enabled. This allows self-test code in probe1() to including a bus-master test. Some datasheets suggest the master bit should not be enabled until after the chip is reset, in case a old transfer is in progress. But the BIOS should have reset the chip at the same time it disabled master capability. The current code rewrites the PCI latency register for BIOSes that leave it at zero or a very low value. This is a questionable practice, although some devices require it for proper operation. The PCI_NO_MIN_LATENCY flag disables this. The usual need is for a setting of at least 10, with a higher requirement uncommon. For instance, the 3Com 3c590 series adapters requires their maximum possible of 248 setting to avoid a design problem. The code activates the device to ACPI D0 state before the probe1() code is called, unless the `PCI_NO_ACPI_WAKE` flag is set. Not all devices need to be awakened when scanned, but most (especially Ethernet adapters) do. Many newer MS-Windows drivers leave the hardware in D3 state, which commonly persists through a warm boot. Activating the device permits drivers to ignore ACPI, which is especially useful since ACPI is a feature that is silently added to the hardware after the driver is written. Linux Network Drivers Page SCYLD information. Author:* Donald Becker See the drivers for the contact email address. Do not bother sending email to zinc.anode@scyld.com, as email to that address adds your domain or IP address to the known-spammer list.


© 2000-2002 Scyld Computing Corporation