fosstodon.org is one of the many independent Mastodon servers you can use to participate in the fediverse.
Fosstodon is an invite only Mastodon instance that is open to those who are interested in technology; particularly free & open source software. If you wish to join, contact us for an invite.

Administered by:

Server stats:

8.6K
active users

#aitesting

0 posts0 participants0 posts today

David Berenstein has joined the Giskard team as DevRel ⭐🐢

David brings valuable experience from his previous roles at Argilla and Hugging Face, where he helped developers discover the joys of working with (synthetic) data. He loves cooking things up with data but also commits a lot of his time to cooking in real life 👨‍🍳 His expertise will be key as we build our LLM Evaluation Hub.

Welcome to the team, David! 🚀

AST NEWSLETTER FOR FEBRUARY IS OUT!
The conversation around AI-driven testing tools is reaching fever pitch, but does the technology live up to its billing? In this month’s news spotlight, we explore the real-world impact, technical hurdles, and a call to action for testers to share concrete results.

Also get your latest updates on CAST 2025!

open.substack.com/pub/associat

open.substack.comAST Monthly NewsletterFebruary 2025
Continued thread

◆ Hallucination and factual accuracy
◆ Bias and fairness
◆ Resistance to adversarial attacks
◆ Harmful content prevention

The LLM Benchmark incorporates diverse linguistic and cultural contexts to ensure comprehensiveness, and representative samples will be open-source.

Read about our methodology, and early findings: gisk.ar/3CRFdeB

We will be sharing more results in the coming months 👀

gisk.arGiskard announces a new LLM Evaluation Benchmark during the Paris AI SummitGiskard partners with Google DeepMind to launch an independent multilingual LLM benchmark, evaluating hallucinations and AI security risks.

Can we trust DeepSeek R1? A Giskard evaluation 🐳🐢

With all the hype around DeepSeek R1, our LLM safety research team decided to conduct an evaluation to check if R1 is as good as it claims. While it impresses in some areas, we found critical limitations that raise concerns for real-world applications. Here are some unexpected examples 👇

🐢 Seek for the turtle in Cannes! ☀️

Join us at the World AI Cannes Festival (WAICF) from February 13-15!

Stop by our booth and meet our team to discuss about quality, security, and compliance for GenAI applications.
More detail about our participation coming soon...👀

Are you attending WAICF? Drop a comment below or DM us to schedule a meeting.

Me: How many letters 'r' are there in the word 'strawberry'?

ChatGPT answers: There are three occurrences of the letter ‘r’ in the word ‘strawberry’.

ChatGPT, in another chat answers: The word strawberry contains three letters ‘r’.

(This is better than the '2', 2 months ago when I first asked).

Google's Gemini: There are 3 letters "r" in the word "strawberry."

(First time of use for questions of this type).

⚡️ Building and evaluating a Banking Supervision agent 🔍

We've published a new tutorial that shows how to:
• Build a RAG agent with LlamaIndex to answer questions about ECB banking supervision
• Scan for LLM vulnerabilities like hallucinations and prompt injection
• Evaluate RAG components (retriever, generator, rewriter) with different question types

Check out the complete tutorial in our docs: gisk.ar/3OQ1tYz
More details about the results👇