Skip to content

Conversation

YingboMa
Copy link
Contributor

@YingboMa YingboMa commented Feb 7, 2019

map has an inline limit of 16. To make sure that the whole broadcast tree gets inlined properly, I added the _inlined_map function. I am not sure if it is a good idea, but worth trying.

This PR solves the issue which I have mentioned in
#30973

julia> @allocated foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
                    k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
                    k26, k27, k28, k29, k30, k31, k32, k33, k34)
0

As LLVM vectorization cut-off is 10 arrays, it doesn't make sense to try to make map inline for more than 16 args. I have removed _inlined_map, but preserved other @inlines to ensure a 9-array broadcast has no allocation.

`map` has an inline limit of 16. To make sure that the whole broadcast
tree gets inlined properly, I added the `_inlined_map` function. I am not
sure if it is a good idea, but worth trying.

This PR solves the issue which I have mentioned in
JuliaLang@2693778#issuecomment-461248258

```julia
julia> @allocated foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
                    k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
                    k26, k27, k28, k29, k30, k31, k32, k33, k34)
0
```
@YingboMa
Copy link
Contributor Author

YingboMa commented Feb 7, 2019

using Test
u, uprev, k1, k2, k3, k4, k5, k6, k7, k8 = (ones(1000) for i in 1:10)
function foo(u, uprev, k1, k2, k3, k4, k5, k6) # vectorized
    @. u = uprev + 0.1*(0.1*k1 + 0.2*k2 + 0.3*k3 + 0.4*k4 + 0.5*k5 + 0.6*k6)
    nothing
end
function goo(u, uprev, k1, k2, k3, k4, k5, k6, k7) # vectorized
    @. u = uprev + 0.1*(0.1*k1 + 0.2*k2 + 0.3*k3 + 0.4*k4 + 0.5*k5 + 0.6*k6 + 0.7*k7)
    nothing
end
function hoo(u, uprev, k1, k2, k3, k4, k5, k6, k7, k8) # not vectorized
    @. u = uprev + 0.1*(0.1*k1 + 0.2*k2 + 0.3*k3 + 0.4*k4 + 0.5*k5 + 0.6*k6 + 0.7*k7 + 0.8*k8)
    nothing
end
function ioo(u, uprev, k1, k2, k3, k4, k5, k6, k7, k8) # not vectorized
    @. u = uprev + k1 + k2 + k3 + k4 + k5 + k6 + k7 + k8
    nothing
end
@allocated foo(u, uprev, k1, k2, k3, k4, k5, k6)
@allocated goo(u, uprev, k1, k2, k3, k4, k5, k6, k7)
@allocated hoo(u, uprev, k1, k2, k3, k4, k5, k6, k7, k8)
@allocated ioo(u, uprev, k1, k2, k3, k4, k5, k6, k7, k8)
@test @allocated(foo(u, uprev, k1, k2, k3, k4, k5, k6)) == 0
@test @allocated(goo(u, uprev, k1, k2, k3, k4, k5, k6, k7)) == 0
@test @allocated(hoo(u, uprev, k1, k2, k3, k4, k5, k6, k7, k8)) == 0
@test @allocated(ioo(u, uprev, k1, k2, k3, k4, k5, k6, k7, k8)) == 0
@test occursin("vector.body", sprint(code_llvm, foo, NTuple{8, Vector{Float32}}))
@test occursin("vector.body", sprint(code_llvm, goo, NTuple{9, Vector{Float32}}))
@test occursin("vector.body", sprint(code_llvm, hoo, NTuple{10, Vector{Float32}}))
@test occursin("vector.body", sprint(code_llvm, ioo, NTuple{10, Vector{Float32}}))
julia> ioo(u, uprev, k1, k2, k3, k4, k5, k6, k7, k8)
remark: simdloop.jl:71:0: loop not vectorized: cannot prove it is safe to reorder memory operations
remark: simdloop.jl:71:0: loop not vectorized
remark: simdloop.jl:71:0: loop not vectorized: cannot identify array bounds
remark: simdloop.jl:71:0: loop not vectorized
remark: simdloop.jl:71:0: loop not vectorized: cannot identify array bounds
remark: simdloop.jl:71:0: loop not vectorized
remark: simdloop.jl:71:0: loop not vectorized: cannot identify array bounds
remark: simdloop.jl:71:0: loop not vectorized
remark: simdloop.jl:71:0: loop not vectorized: cannot identify array bounds
remark: simdloop.jl:71:0: loop not vectorized

Looks like the new vectorization cut off is 10 arrays.
CC: @mbauman

@YingboMa
Copy link
Contributor Author

YingboMa commented Feb 7, 2019

https://github.com/JuliaLang/julia/pull/30986/files#diff-fdaf2dae46d7b04f9c9d9e3940188cb9R802 wouldn't pass because Julia enables bounds check during testing.

@ararslan ararslan requested a review from mbauman February 7, 2019 05:30
@ararslan ararslan added the broadcast Applying a function over a collection label Feb 7, 2019
@mbauman
Copy link
Member

mbauman commented Feb 7, 2019

Nice work — thanks!

@mbauman mbauman merged commit a3c95a6 into JuliaLang:mb/hack28126 Feb 7, 2019
@YingboMa YingboMa deleted the mb/hack28126 branch February 7, 2019 07:10
mbauman pushed a commit that referenced this pull request Mar 14, 2019
* Inline the whole broadcast expression to avoid allocation

`map` has an inline limit of 16. To make sure that the whole broadcast
tree gets inlined properly, I added the `_inlined_map` function. I am not
sure if it is a good idea, but worth trying.

This PR solves the issue which I have mentioned in
2693778#issuecomment-461248258

```julia
julia> @allocated foo(tmp, uprev, k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13,
                    k14, k15, k16, k17, k18, k19, k20, k21, k22, k23, k24, k25,
                    k26, k27, k28, k29, k30, k31, k32, k33, k34)
0
```

* Fix CI failure

* Stricter test (vectorization & no allocation for a 9-array bc)

* rm `_inlined_map`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
broadcast Applying a function over a collection
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants