It’s Friday. WorldCon is going on in Glasgow and I just finished working an extremely daunting schedule. About 78 hours last week, around 50 hours this week, with just a little bit more to do this weekend. Ouch.
Exhausted and punchy, I keep seeing posts from friends at WorldCon and the FOMO is not only real, it is bitter. So to distract myself, I’m cleaning my room and working on the keyboard for the Writing Excuses Retreat. I’m also thinking about A.I.
Before diving into this topic, let me get some things out of the way:
- I’m not a tech-bro
- I don’t have a ChatGPT subscription. I played around early on to see what the buzz was about. I don’t use LLMs (like ChatGPT) and I don’t advise people to use them
- I have no skin in the game
- I know the difference between different types of A.I., and I think there should be more nuanced discussions about them. While LLMs are A.I., not all A.I. are LLMs
- I think artists should be paid for their work and stealing content to train A.I. is wrong
With that out of the way, let me take one more step towards clarification and say that this post is about LLMs. Most of the chatter online about LLMs, especially ChatGPT and the OpenAI products, simply refer to those products as A.I. That’s why I’m titling this post the way I am.
The enshitification of Google search (and other online services) has to do with plugging LLMs into the mix. LLMs don’t do any actual thinking or analysis of a topic the way humans do. When given a query, an LLM will start generating text that is related to that topic, and it will continue generating text based on interpreting what the next word in its response should be.
If the quality of the source material is high and accurate, the LLM will generate responses that seem well reasoned and sound. If the LLM has been trained on shit-posts from Reddit, it will respond with fantastical, insane answers that no one should take seriously. That’s how we wind up with Google telling people to put glue on their pizza.
This is not new news. LLM and A.I. advancement in general is suffering from bad press and righteous anger from users that really, really do not want this garbage in their browsers, desktops, or phones.
Also, in this post, I’m not considering the environmental consequences of maintaining servers for A.I. What I mostly want to talk about is training LLMs without the developers having to steal from artists, creators, writers, etc.
The Idea
We create a company that will manage an ethically sourced LLM. This might not be a company. It might be a government resource. I’m not sure yet. The main idea is that people can submit their own work to this organization to be part of the training material.
“How do we incentivize people to submit their work to the machine?”
We pay them. It’s not going to be a lot. The person submitting the work retains rights to what they created. It’s like when publishers buy the rights to print and distribute an author’s story. The LLM entity will buy the rights to train using the work, so the author/artist gets some compensation.
“So anyone can just submit anything and get paid for it? That sounds like it would be easy to exploit.”
Before the person submitting the work gets paid, there is a vetting process. Again, we’ll pay people to do this work, which involves three main parts:
- Make sure the work has not been submitted before
- Make sure the work belongs to the person submitting it, or that the submitter has the authority and legal rights to submit the work
- Make sure the material is not poisonous
We have examples of companies/publishers handling the first two tasks already. And, strangle enough, we have an example of the third being handled as well: Wikipedia.
If we pay for the kind of work that people already volunteer to do on Wikipedia, we might be able to make sure only high quality content is fed into machine.
“This sounds like it is doomed to lose money.”
That may be true. The value of this LLM will certainly be low at the beginning, and the entire program will need to be subsidized before it can produce anything of competitive quality. If the endeavor manages to last long enough, however, we can start licensing the use of the LLM to companies and individuals.
Google, Microsoft, and Apple would probably pay significant sums of cash in order to take advantage of an ethically sourced LLM. One that isn’t telling people that pythons are mammals.
“What problem does this solve?”
My suggestion addresses at least part of the problem of stealing content in order to train the LLM. This idea also offers a way to compensate creators for their work, even if the compensation is very small. If we’re going to live in a world with LLMs, I think this is a way to do it without theft.
However, what do we actually need LLMs for? What problem do the LLMs solve?
I can’t think of a compelling need that LLMs satisfy. It is an extremely geeky way of generating text. I think it is interesting, but I’m not sure it’s useful.
If I was forced to guess, I would say that successfully navigating searches with a handful of keywords has always been something of a challenge to non-technical people. There have been attempts to make searches use more natural language since ask.com (Jeeves) was created in 1996.
An LLM tied to a search engine, therefore, allows natural language in and out, without requiring the user to drill through a potentially long search result to find the answer they’re looking for. Maybe there is a little bit of value there. It’s dubious, but I think I can see it.
I’m not sure it’s worth the time, investment, or cost to the environment, though.
That’s my idea. Let me know what you think.