🧪 New Notebook Drop: Evaluating LLMs for harmful outputs! Which models are actually safe for prod? We built an LLM-as-a-Judge pipeline using the Together Evals API to compare models on harmfulness. Code👇
1,79K