fosstodon.org is one of the many independent Mastodon servers you can use to participate in the fediverse.
Fosstodon is an invite only Mastodon instance that is open to those who are interested in technology; particularly free & open source software. If you wish to join, contact us for an invite.

Administered by:

Server stats:

11K
active users

Leaked Google document: “We Have No Moat, And Neither Does OpenAI”

The most interesting thing I've read recently about LLMs - a purportedly leaked document from a researcher at Google talking about the huge strategic impact open source models are having
simonwillison.net/2023/May/4/n

simonwillison.netLeaked Google document: “We Have No Moat, And Neither Does OpenAI”SemiAnalysis published something of a bombshell leaked document this morning: Google “We Have No Moat, And Neither Does OpenAI”. The source of the document is vague: The text below is …
Jim Gardner

@simon Hey Simon, I’ve been holding off the use of ChatGPT, Bard, etc., even though I think they could be useful. This is because I can see (especially with ChatGPT) the horrible unethical behaviour that the companies are using in their arms race to deploy deploy deploy. With all the talk in this leaked doc about open source alternatives, do you know of any LLMs that are “ethically sourced” and available for the average punter to use? I don’t want to be left behind :/

@jimgar the ethics of this stuff is incredibly complicated

I'm very optimistic about the models being trained on the RedPajama data - there's one out already and evidently more to follow very shortly simonwillison.net/tags/redpaja

simonwillison.netSimon Willison on redpajama

Claude is an interesting option that's one of the most promising closed alternatives to ChatGPT - they have an interesting approach to AI safety which they call "constitutional AI" anthropic.com/index/introducin

AnthropicIntroducing ClaudeAfter working with key partners for the past few months, we’re opening up access to Claude, our AI assistant.

@simon thank you so much, l’ll give these a look. Everywhere I look in tech it’s one ethical nightmare after another 😵‍💫

@simon what's your take on the copyrighted material included in RedPajama through CommonCrawl? It seems to me that one could train a model on only text that has been shared freely and that might be more ethical. cc @jimgar

@resing @jimgar I'm not convinced it's possible to train a usable LLM without including copyrighted material in they raw pretraining data

As such, personally think it's a necessary evil to avoid a monopoly on LLM technology belonging to organizations that are willing to train against crawler data

@simon @jimgar not sure I follow. Are you saying that crawler data, which includes copyrighted material shouldn’t be used by commercial companies and LLMs are inherently flawed because of that? If so, I’m not saying you’re wrong, just trying to understand.

@simon @resing it *all* feels fundamentally wrong, so long as the results rely on indiscriminate harvesting of people’s work without permission. Literally the only compelling argument I have heard is the “necessary evil” Simon mentions - doing it anyway but making it open source. I just find it sad that this is the position we’re in at all, and worse, how little the majority of people seem to care about providence and permissions full stop.

@jimgar @resing search engines work by indiscriminately harvesting people's work without their permission, and have done for decades

What's different here isn't how the things are built, it's what they can be used for

People mostly tolerated search engines because they saw them as useful - they helped people's work be found, they didn't (appear to) threaten their livelihoods

@jimgar @resing note that I'm not saying that search engines were morally/ethically pure here either!

The ethics around this are deeply complicated - there are no easy or obvious answers

@simon @jimgar the legal issue might be resolved soon. if @binarybits@instinctd.com is right, Stable Diffusion could lose the lawsuit against them. I buy his argument in favor of that. If that's the case, LLMs trained on sets that only allow that use might really take off arstechnica.com/tech-policy/20

Ars Technica · Stable Diffusion copyright lawsuits could be a legal earthquake for AIExperts say generative AI is in uncharted legal waters.