Policy

class parallelformers.policies.base.policy.Layer(weight: Optional[str] = None, bias: Optional[str] = None, n_fused: Optional[int] = None, replace: Optional[Any] = None, reversed: Optional[Any] = None, ignore_checker: bool = False)[source]

Bases: object

Dataclass used to describe a layer in the policy object

weight and bias

the names of the weight and bias tensors, respectively. You can use the syntax such as [ ] or . to the tensor names. . is used as accessors in common programming languages and [ ] is used to access elements in nn.ModuleList.

Type

str

n_fused

the number of areas used in fused layers. For example, GPT2 and TransfoXL have fused attention layers that consist of query, key and value. These layers should not be simply chunked by the number of GPUs. Instead, they should be divided into the query, key and value areas first.

Type

int

replace

the layer that you want to replace an existing layer with. The parallelization process by the tensor slicing method involves All-Reduce operations to collect tensors from all GPUs. So, we need to insert some layer like AllReduceLinear to replace the existing nn.Linear layer.

Type

Any

reversed

this attribute is used to indicate whether tensors are reversed or not. Some models such as GPT1 and GPT2 use the transformers.modeling_utils.Conv1D layer instead of the nn.Linear layer. These layers store weight and bias tensors reversed.

Type

bool

ignore_checker

this attribute is used when you want to ignore errors in case the layers do not exist. Some models like Bert, Roberta and Electra have only encoder layers. but for Huggingface, these models are also designed to be able to used as decoders. In these models, some layers may or may not be created depending on the configuraions. In this case, you can use ignore_checker option to ignore errors even if the layers do not always exist.

Type

bool

weight: str = None
bias: str = None
n_fused: int = None
replace: Any = None
reversed: Any = None
ignore_checker: bool = False
class parallelformers.policies.base.policy.Policy(layer: torch.nn.modules.module.Module)[source]

Bases: abc.ABC

Policy object to apply parallelism per model

static replace_arguments(config, world_size: int) Dict[source]

Policy for argument replacement.

Parameters
  • config (Config) – Huggingface config object

  • world_size (int) – total number of gpu for parallelization

Returns

Dictionary for argument replacement.

Return type

Dict

Notes

The format of the dictionary object is as follows.

dict:

“param_name_1”: reset_value_1, “param_name_2”: reset_value_2, “param_name_3”: reset_value_3, … “param_name_n”: reset_value_n

static replace_modules() Dict[source]

Policy for class (module) replacement.

Returns

Dictionary for class (module) replacement.

Return type

Dict

Notes

The format of the dictionary object is as follows.

dict:

orig_class_name_1: reset_module_class_1, orig_class_name_2: reset_module_class_2, orig_class_name_3: reset_module_class_3, … orig_class_name_4: reset_module_class_n

static attn_qkv() List[source]

Attention query, key, value projection layer

Returns

List of layer object

Return type

List[Layer]

static attn_out() List[source]

Attention output projection layer

Returns

List of layer object

Return type

List[Layer]

static mlp_in() List[source]

h -> 4h mlp layer

Returns

List of layer object

Return type

List[Layer]

static mlp_out() List[source]

4h -> h mlp layer

Returns

List of layer object

Return type

List[Layer]

abstract static original_layer_class() Type[torch.nn.modules.module.Module][source]

Class to apply the policy to e.g. BertLayer, GPT2Block, BartEncoderLayer, …

Returns

original layer class

Return type

Type[nn.Module]