Skip to content

Expert Parallelism #4

@sirish-gambhira

Description

@sirish-gambhira

Hi team,

Thank you for sharing your code.

Could you please elaborate how you are performing Expert Parallelism: We shard MLP experts across all devices to fit Hessians into VRAM, required for GPTQ calibration. Each process stores only a fraction of expert layers and corresponding Hessians. ?

In the current codebase, the number of block keys in the main process is equal to number of keys in the other GPUs, i.e, (https://github.com/IST-DASLab/MoE-Quant/blob/master/quant.py#L181 and https://github.com/IST-DASLab/MoE-Quant/blob/master/quant.py#L187). Thus, each rank stores an entire copy of the state dict - in contrast to idea of expert parallelism where MLP experts are sharded.

Ask: Could you please share a way how we can confirm that the experts are sharded across the ranks?

For an 8-GPU setup, here's the output for block_idx = 3 (starting from 0)

Main process num states: 110 Device: 1 num states: 110 Device: 2 num states: 110 Device: 3 num states: 110 Device: 4 num states: 110 Device: 5 num states: 110 Device: 6 num states: 110 Device: 7 num states: 110

Tagging @Godofnothing @eldarkurtic for reach. Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions