Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

People specifically would like to know what the attention calculations add to this learning of the distribution


Just speculating but I think attention enables differentiation of semantic concepts for a word or sentence within a particular context. Like for any total set of training data you have a lesser number of semantic concepts (like let's say you have 10000 words, then it might contain 2000 semantic concepts, and those concepts are defined by the sentence structure and surrounding words, which is why they have a particular meaning), and then attention allows to differentiate those different contexts at different levels (words/etc). Also the fact you can do this attention at runtime/inference means you can generate the context from the prompt, which enables the flexibility of variable prompt/variable output but you lose the precision of giving an exact prompt and getting an exact answer


I'm not one to whine about downvotes but I just have to say, it's a bad feeling when I can't even respond to the negative feedback because there is no accompanying comment. Did I misinterpret something? Did you? Who will ever know when there is no information. :L




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: