I also tried the demo and I find it pretty much useless at most things even comparing it to a small 7b transformer model like mistral.
From my albeit quick tests, what I found is that it knows clearly less things than mistral, it hallucinates much more, it does not follow instructions, has less reasoning capabilities and asking it to translate a Japanese text into English gave me a bad translated summary instead of the full translation.
I don't see how this is soaring past transformers when clearly it's unable to do any of the useful tasks you can use a transformer model for today...
As written in the post, it is a base model with light instruction tuning, i.e. Llama2, not Llama2-chat. You should evaluate it as a base model. If you evaluate it as a chat model, of course it will perform horribly.
3 things stand out to me:
- it's absolutely not useable for the kind of use cases I solve with GPT-4 (code generation, information retrieval)
- it could technically swallow a 50 page PDF, but it's not able to answer questions about it (inference speed was good, but content was garbage)
- it is ok for chatting and translations (how is your day?)