Policy¶

class parallelformers.policies.base.policy.Layer(weight: Optional[str] = None, bias: Optional[str] = None, n_fused: Optional[int] = None, replace: Optional[Any] = None, reversed: Optional[Any] = None, ignore_checker: bool = False)[source]¶

Bases: object

Dataclass used to describe a layer in the policy object

weight and bias

the names of the weight and bias tensors, respectively. You can use the syntax such as [ ] or . to the tensor names. . is used as accessors in common programming languages and [ ] is used to access elements in nn.ModuleList.

Type: str

n_fused¶

the number of areas used in fused layers. For example, GPT2 and TransfoXL have fused attention layers that consist of query, key and value. These layers should not be simply chunked by the number of GPUs. Instead, they should be divided into the query, key and value areas first.

Type: int

replace¶

the layer that you want to replace an existing layer with. The parallelization process by the tensor slicing method involves All-Reduce operations to collect tensors from all GPUs. So, we need to insert some layer like AllReduceLinear to replace the existing nn.Linear layer.

Type: Any

reversed¶

this attribute is used to indicate whether tensors are reversed or not. Some models such as GPT1 and GPT2 use the transformers.modeling_utils.Conv1D layer instead of the nn.Linear layer. These layers store weight and bias tensors reversed.

Type: bool

ignore_checker¶

this attribute is used when you want to ignore errors in case the layers do not exist. Some models like Bert, Roberta and Electra have only encoder layers. but for Huggingface, these models are also designed to be able to used as decoders. In these models, some layers may or may not be created depending on the configuraions. In this case, you can use ignore_checker option to ignore errors even if the layers do not always exist.

Type: bool

weight: str = None¶

bias: str = None¶

n_fused: int = None¶

replace: Any = None¶

reversed: Any = None¶

ignore_checker: bool = False¶

class parallelformers.policies.base.policy.Policy(layer: torch.nn.modules.module.Module)[source]¶

Bases: abc.ABC

Policy object to apply parallelism per model

static replace_arguments(config, world_size: int) → Dict[source]¶

Policy for argument replacement.

Parameters

config (Config) – Huggingface config object
world_size (int) – total number of gpu for parallelization

Returns

Dictionary for argument replacement.

Return type

Dict

Notes

The format of the dictionary object is as follows.

dict:: “param_name_1”: reset_value_1, “param_name_2”: reset_value_2, “param_name_3”: reset_value_3, … “param_name_n”: reset_value_n

static replace_modules() → Dict[source]¶

Policy for class (module) replacement.

Returns: Dictionary for class (module) replacement.
Return type: Dict

Notes

The format of the dictionary object is as follows.

dict:: orig_class_name_1: reset_module_class_1, orig_class_name_2: reset_module_class_2, orig_class_name_3: reset_module_class_3, … orig_class_name_4: reset_module_class_n

static attn_qkv() → List[source]¶

Attention query, key, value projection layer

Returns: List of layer object
Return type: List[Layer]

static attn_out() → List[source]¶

Attention output projection layer

Returns: List of layer object
Return type: List[Layer]

static mlp_in() → List[source]¶

h -> 4h mlp layer

Returns: List of layer object
Return type: List[Layer]

static mlp_out() → List[source]¶

4h -> h mlp layer

Returns: List of layer object
Return type: List[Layer]

abstract static original_layer_class() → Type[torch.nn.modules.module.Module][source]¶

Class to apply the policy to e.g. BertLayer, GPT2Block, BartEncoderLayer, …

Returns: original layer class
Return type: Type[nn.Module]