Establishing Best Practices for Building Rigorous Agentic Benchmarks arxiv.org 1 points by frontfor 16 hours ago