Anyone dare to guess how fast would GH copilot be if it ran locally ? My main pr...

atq2119 · on April 6, 2023

One of the weaknesses of transformers is that inference is inherently serial, one token at a time, and the entire model needs to be read once per token. This means that inference (after prompt processing) is bounded by memory bandwidth instead of compute.

That said, local solutions always tend to be lower latency than cloud ones just because you get to skip the network.