Policy¶
- class parallelformers.policies.base.policy.Layer(weight: Optional[str] = None, bias: Optional[str] = None, n_fused: Optional[int] = None, replace: Optional[Any] = None, reversed: Optional[Any] = None, ignore_checker: bool = False)[source]¶
Bases:
object
Dataclass used to describe a layer in the policy object
- weight and bias
the names of the weight and bias tensors, respectively. You can use the syntax such as [ ] or . to the tensor names. . is used as accessors in common programming languages and [ ] is used to access elements in nn.ModuleList.
- Type
- n_fused¶
the number of areas used in fused layers. For example, GPT2 and TransfoXL have fused attention layers that consist of query, key and value. These layers should not be simply chunked by the number of GPUs. Instead, they should be divided into the query, key and value areas first.
- Type
- replace¶
the layer that you want to replace an existing layer with. The parallelization process by the tensor slicing method involves All-Reduce operations to collect tensors from all GPUs. So, we need to insert some layer like AllReduceLinear to replace the existing nn.Linear layer.
- Type
Any
- reversed¶
this attribute is used to indicate whether tensors are reversed or not. Some models such as GPT1 and GPT2 use the transformers.modeling_utils.Conv1D layer instead of the nn.Linear layer. These layers store weight and bias tensors reversed.
- Type
- ignore_checker¶
this attribute is used when you want to ignore errors in case the layers do not exist. Some models like Bert, Roberta and Electra have only encoder layers. but for Huggingface, these models are also designed to be able to used as decoders. In these models, some layers may or may not be created depending on the configuraions. In this case, you can use ignore_checker option to ignore errors even if the layers do not always exist.
- Type
- replace: Any = None¶
- reversed: Any = None¶
- class parallelformers.policies.base.policy.Policy(layer: torch.nn.modules.module.Module)[source]¶
Bases:
abc.ABC
Policy object to apply parallelism per model
- static replace_arguments(config, world_size: int) Dict [source]¶
Policy for argument replacement.
- Parameters
config (Config) – Huggingface config object
world_size (int) – total number of gpu for parallelization
- Returns
Dictionary for argument replacement.
- Return type
Dict
Notes
The format of the dictionary object is as follows.
- dict:
“param_name_1”: reset_value_1, “param_name_2”: reset_value_2, “param_name_3”: reset_value_3, … “param_name_n”: reset_value_n
- static replace_modules() Dict [source]¶
Policy for class (module) replacement.
- Returns
Dictionary for class (module) replacement.
- Return type
Dict
Notes
The format of the dictionary object is as follows.
- dict:
orig_class_name_1: reset_module_class_1, orig_class_name_2: reset_module_class_2, orig_class_name_3: reset_module_class_3, … orig_class_name_4: reset_module_class_n
- static attn_qkv() List [source]¶
Attention query, key, value projection layer
- Returns
List of layer object
- Return type
List[Layer]
- static attn_out() List [source]¶
Attention output projection layer
- Returns
List of layer object
- Return type
List[Layer]
- static mlp_in() List [source]¶
h -> 4h mlp layer
- Returns
List of layer object
- Return type
List[Layer]