The VanillaVAE model's final layer is nn.Tanh(). This means the reconstructed image it outputs will have pixel values in the range [-1, 1]. However, The problem is that transforms.ToTensor() converts input images to the range [0, 1]. So the loss is incorrect.