I’m restarting my LLM evaluations. My focus will be on financial research tasks. Initial ideas: 1 • news sentiment analysis 2 • financial calculations 3 • 10-K analysis, etc All code will be shared - as goal is to learn. Will test both small and large models. I’ll also finetune small open source models and see how they compare to large ones on specific tasks! Experiment ideas are welcome.
Image above shows frontier LLMs. It is impressive how much performance we get from Kimi K2 and DeepSeek R1, given price. Total Cost is input + output token costs. Performance is LLM arena ELO score. I'll define my own ELO scores as I experiment.
6,34K