fosstodon.org is one of the many independent Mastodon servers you can use to participate in the fediverse.
Fosstodon is an invite only Mastodon instance that is open to those who are interested in technology; particularly free & open source software. If you wish to join, contact us for an invite.

Administered by:

Server stats:

8.8K
active users

#swebench

0 posts0 participants0 posts today
Sara Zan<p>📢 Don't overlook this in the wave of releases! <a href="https://mastodon.social/tags/MistralAI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>MistralAI</span></a> has a new coding LLM: it's <a href="https://mastodon.social/tags/Devstral" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Devstral</span></a>, an open model perfect for on-prem, private and local deployments 🐈</p><p>📰 Have a look at the announcement: <a href="https://mistral.ai/news/devstral" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="">mistral.ai/news/devstral</span><span class="invisible"></span></a></p><p><a href="https://mastodon.social/tags/AI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>AI</span></a> <a href="https://mastodon.social/tags/GenAI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>GenAI</span></a> <a href="https://mastodon.social/tags/LLMs" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>LLMs</span></a> <a href="https://mastodon.social/tags/Devstral" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Devstral</span></a> <a href="https://mastodon.social/tags/SWEBench" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>SWEBench</span></a></p>
Sara Zan<p>🧠 Another flagship model released! <a href="https://mastodon.social/tags/Anthropic" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Anthropic</span></a> just unveiled Claude Opus 4 and Claude Sonnet 4, and they are at the top of the leaderboard for coding 💻</p><p>📰 Check out the announcement: <a href="https://www.anthropic.com/news/claude-4" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://www.</span><span class="">anthropic.com/news/claude-4</span><span class="invisible"></span></a></p><p><a href="https://mastodon.social/tags/AI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>AI</span></a> <a href="https://mastodon.social/tags/GenAI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>GenAI</span></a> <a href="https://mastodon.social/tags/LLMs" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>LLMs</span></a> <a href="https://mastodon.social/tags/Claude" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Claude</span></a> <a href="https://mastodon.social/tags/Claude4" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Claude4</span></a> <a href="https://mastodon.social/tags/SweBench" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>SweBench</span></a></p>
N-gated Hacker News<p>🎉🥳 OMG, Refact.ai scored a groundbreaking 69.8 on <a href="https://mastodon.social/tags/SWEbench" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>SWEbench</span></a> and now it's charging you in coins! 💰🔧 Apparently, solving 349 out of 500 tasks makes it the reigning champion of open-source AI agents. Who knew moving from request limits to coin tossing was the future of tech? 🤪👨‍💻<br><a href="https://refact.ai/blog/2025/open-source-sota-on-swe-bench-verified-refact-ai/" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">refact.ai/blog/2025/open-sourc</span><span class="invisible">e-sota-on-swe-bench-verified-refact-ai/</span></a> <a href="https://mastodon.social/tags/RefactAI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>RefactAI</span></a> <a href="https://mastodon.social/tags/openSourceAI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>openSourceAI</span></a> <a href="https://mastodon.social/tags/techInnovation" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>techInnovation</span></a> <a href="https://mastodon.social/tags/coinTossing" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>coinTossing</span></a> <a href="https://mastodon.social/tags/HackerNews" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>HackerNews</span></a> <a href="https://mastodon.social/tags/ngated" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>ngated</span></a></p>
michabbb<p><a href="https://social.vivaldi.net/tags/Devstral" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Devstral</span></a>: New <a href="https://social.vivaldi.net/tags/opensource" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>opensource</span></a> Model for Coding Agents by <a href="https://social.vivaldi.net/tags/MistralAI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>MistralAI</span></a> &amp; <a href="https://social.vivaldi.net/tags/AllHandsAI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>AllHandsAI</span></a> 🧠</p><p>• 🏆 <a href="https://social.vivaldi.net/tags/Devstral" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Devstral</span></a> achieves 46.8% on <a href="https://social.vivaldi.net/tags/SWEBench" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>SWEBench</span></a> Verified, outperforming previous <a href="https://social.vivaldi.net/tags/opensource" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>opensource</span></a> models by over 6% points and surpassing <a href="https://social.vivaldi.net/tags/GPT4" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>GPT4</span></a> mini by 20%</p><p>🧵👇<a href="https://social.vivaldi.net/tags/AI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>AI</span></a> <a href="https://social.vivaldi.net/tags/coding" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>coding</span></a></p>
Habr<p>Как мы собираем SWE-bench на других языках</p><p>Современная разработка ПО — это плавильный котел языков: Java, C#, JS/TS, Go, Kotlin… список можно продолжать. Но когда дело доходит до оценки ИИ-агентов, способных помогать в написании и исправлении кода, мы часто упираемся в ограничения. Популярный бенчмарк SWE-bench, например, долгое время поддерживал только Python. Чтобы преодолеть разрыв между реальностью разработки и возможностями оценки ИИ, наша команда в</p><p><a href="https://habr.com/ru/companies/doubletapp/articles/901032/" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">habr.com/ru/companies/doubleta</span><span class="invisible">pp/articles/901032/</span></a></p><p><a href="https://zhub.link/tags/swebench" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>swebench</span></a> <a href="https://zhub.link/tags/%D0%B8%D0%B8" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>ии</span></a> <a href="https://zhub.link/tags/%D0%BD%D0%B5%D0%B9%D1%80%D0%BE%D1%81%D0%B5%D1%82%D0%B8" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>нейросети</span></a> <a href="https://zhub.link/tags/ml" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>ml</span></a> <a href="https://zhub.link/tags/%D0%BC%D0%B0%D1%88%D0%B8%D0%BD%D0%BD%D0%BE%D0%B5_%D0%BE%D0%B1%D1%83%D1%87%D0%B5%D0%BD%D0%B8%D0%B5" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>машинное_обучение</span></a> <a href="https://zhub.link/tags/%D0%B8%D1%81%D0%BA%D1%83%D1%81%D1%81%D1%82%D0%B2%D0%B5%D0%BD%D0%BD%D1%8B%D0%B9_%D0%B8%D0%BD%D1%82%D0%B5%D0%BB%D0%BB%D0%B5%D0%BA%D1%82" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>искусственный_интеллект</span></a> <a href="https://zhub.link/tags/github" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>github</span></a> <a href="https://zhub.link/tags/open_source" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>open_source</span></a></p>
Habr<p>[Перевод] Сравнение бенчмарков LLM для разработки программного обеспечения</p><p>В этой статье мы сравним различные бенчмарки, которые помогают ранжировать крупные языковые модели для задач разработки программного обеспечения.</p><p><a href="https://habr.com/ru/articles/857754/" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="">habr.com/ru/articles/857754/</span><span class="invisible"></span></a></p><p><a href="https://zhub.link/tags/LLM" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>LLM</span></a> <a href="https://zhub.link/tags/%D0%B1%D0%B5%D0%BD%D1%87%D0%BC%D0%B0%D1%80%D0%BA%D0%B8" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>бенчмарки</span></a> <a href="https://zhub.link/tags/%D0%B1%D0%B5%D0%BD%D1%87%D0%BC%D0%B0%D1%80%D0%BA%D0%B8%D0%BD%D0%B3" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>бенчмаркинг</span></a> <a href="https://zhub.link/tags/HumanEval" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>HumanEval</span></a> <a href="https://zhub.link/tags/DevQualityEval" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>DevQualityEval</span></a> <a href="https://zhub.link/tags/CodeXGLUE" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>CodeXGLUE</span></a> <a href="https://zhub.link/tags/Aider" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Aider</span></a> <a href="https://zhub.link/tags/SWEbench" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>SWEbench</span></a> <a href="https://zhub.link/tags/ClassEval" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>ClassEval</span></a> <a href="https://zhub.link/tags/BigCodeBench" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>BigCodeBench</span></a></p>
michabbb<p>🚀 <a href="https://social.vivaldi.net/tags/Claude35Sonnet" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Claude35Sonnet</span></a> is now rolling out on <a href="https://social.vivaldi.net/tags/GitHubCopilot" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>GitHubCopilot</span></a>, bringing advanced coding capabilities directly to <a href="https://social.vivaldi.net/tags/VisualStudioCode" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>VisualStudioCode</span></a> and <a href="https://GitHub.com" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="">GitHub.com</span><span class="invisible"></span></a></p><p>• 🏆 Performance highlights:<br>- Highest score among public models on <a href="https://social.vivaldi.net/tags/SWEbench" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>SWEbench</span></a> Verified<br>- 93.7% accuracy on <a href="https://social.vivaldi.net/tags/HumanEval" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>HumanEval</span></a> for <a href="https://social.vivaldi.net/tags/Python" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Python</span></a> function writing</p><p>• 💻 Key features:<br>- Production-ready code generation<br>- Inline debugging assistance<br>- Automated test suite creation<br>- Contextual code explanations</p><p>• ⚙️ Technical details:<br>- Runs via <a href="https://social.vivaldi.net/tags/AmazonBedrock" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>AmazonBedrock</span></a><br>- Cross-region inference for enhanced reliability<br>- Available to all <a href="https://social.vivaldi.net/tags/GitHub" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>GitHub</span></a> Copilot Chat users and organizations</p><p>Source: <a href="https://www.anthropic.com/news/github-copilot" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://www.</span><span class="ellipsis">anthropic.com/news/github-copi</span><span class="invisible">lot</span></a></p>
michabbb<p>🚀 <a href="https://social.vivaldi.net/tags/Anthropic" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Anthropic</span></a> announces major updates to their <a href="https://social.vivaldi.net/tags/AI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>AI</span></a> model lineup:</p><p>💻 Upgraded <a href="https://social.vivaldi.net/tags/Claude35Sonnet" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Claude35Sonnet</span></a> shows significant improvements:<br>• Achieves 49% on <a href="https://social.vivaldi.net/tags/SWEbench" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>SWEbench</span></a> Verified coding benchmark<br>• Leads in software engineering capabilities<br>• Maintains same price and speed as predecessor<br>• Tested by US and UK <a href="https://social.vivaldi.net/tags/AI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>AI</span></a> Safety Institutes</p><p>🔄 New <a href="https://social.vivaldi.net/tags/Claude35Haiku" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Claude35Haiku</span></a> introduction:<br>• Matches <a href="https://social.vivaldi.net/tags/Claude3Opus" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Claude3Opus</span></a> performance at lower cost<br>• Scores 40.6% on SWEbench Verified<br>• Optimized for user-facing products<br>• Available across multiple cloud platforms</p><p>🖱️ Pioneering <a href="https://social.vivaldi.net/tags/ComputerUse" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>ComputerUse</span></a> beta feature:<br>• Allows AI to navigate interfaces like humans<br>• Scores 22% on <a href="https://social.vivaldi.net/tags/OSWorld" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>OSWorld</span></a> benchmark<br>• Currently in experimental phase<br>• Supported by new safety classifiers</p><p>⚡ Enterprise adoption:<br>• <a href="https://social.vivaldi.net/tags/GitLab" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>GitLab</span></a> reports 10% improvement in DevSecOps tasks<br>• <a href="https://social.vivaldi.net/tags/Replit" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Replit</span></a> leverages computer use for app evaluation<br>• <a href="https://social.vivaldi.net/tags/Cognition" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Cognition</span></a> notes enhanced problem-solving capabilities</p><p><a href="https://www.anthropic.com/news/3-5-models-and-computer-use" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://www.</span><span class="ellipsis">anthropic.com/news/3-5-models-</span><span class="invisible">and-computer-use</span></a></p>
marmelab<p>How do AI software engineering agents work?🤔🤖</p><p>Find the answer, along with valuable insights from the creators of SWE-bench &amp; SWE-agent, in this article⬇️</p><p><a href="https://newsletter.pragmaticengineer.com/p/ai-coding-agents?r=3sbses&amp;utm_campaign=post&amp;utm_medium=web" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">newsletter.pragmaticengineer.c</span><span class="invisible">om/p/ai-coding-agents?r=3sbses&amp;utm_campaign=post&amp;utm_medium=web</span></a> </p><p>Great read! 👏 <span class="h-card" translate="no"><a href="https://mastodon.online/@gergelyorosz" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>gergelyorosz</span></a></span>, <span class="h-card" translate="no"><a href="https://mastodon.social/@elin" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>elin</span></a></span> Nilsson<br> <br><a href="https://mastodon.social/tags/AI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>AI</span></a> <a href="https://mastodon.social/tags/SWEbench" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>SWEbench</span></a> <a href="https://mastodon.social/tags/SWEagent" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>SWEagent</span></a> <a href="https://mastodon.social/tags/softwareengineering" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>softwareengineering</span></a></p>