Microsoft has two announcements for subscribers to its Microsoft 365 Personal and Family plans today. First, you're getting the Copilot-powered AI features that Microsoft has been rolling out to businesses and Copilot Pro subscribers, like summarizing or generating text in Word, drafting slideshows in PowerPoint based on a handful of criteria, or analyzing data in Excel. Second, you'll be paying more for the privilege of using those features, to the tune of an extra $3 a month or $30 a year.
This raises the price of a Microsoft 365 Personal subscription from $7 a month or $70 a year to $10 and $100; a family subscription goes from $10 a month or $100 a year to $13 a month or $130 a year. For current subscribers, these prices go into effect the next time your plan renews.
Current subscribers are also being given an escape hatch "for a limited time." "Classic" Personal and Family plans at the old prices with no Copilot features included will still be offered, but you'll need to go to the "services & subscriptions" page of your Microsoft account and attempt to cancel your existing subscription to be offered the discounted pricing.
Microsoft hasn't said for how long this "limited time" offer will last, but presumably it will only last for a year or two to help ease the transition between the old pricing and the new pricing. New subscribers won't be offered the option to pay for the Classic plans.
Subscribers on the Personal and Family plans can't use Copilot indiscriminately; they get 60 AI credits per month to use across all the Office apps, credits that can also be used to generate images or text in Windows apps like Designer, Paint, and Notepad. It's not clear how these will stack with the 15 credits that Microsoft offers for free for apps like Designer, or the 50 credits per month Microsoft is handing out for Image Cocreator in Paint.
As Microsoft notes, this is the first price increase it has ever implemented for the personal Microsoft 365 subscriptions in the US, which have stayed at the same levels since being introduced as Office 365 over a decade ago. Pricing for the business plans and pricing in other countries has increased before. Pricing for Office Home 2024 ($150) and Office Home & Business 2024 ($250), which can't access Copilot or other Microsoft 365 features, is also the same as it was before.
On the Github page for the quixotic project, coder ading2210 discusses how Adobe Acrobat included some robust support for JavaScript in the PDF file format. That JS coding support—which dates back decades and is still fully documented in Adobe's official PDF specs—is currently implemented in a more limited, more secure form as part of PDFium, the built-in PDF-rendering engine of Chromium-based browsers.
In the past, hackers have used this little-known Adobe feature to code simple games like Breakout and Tetris into PDF documents. But ading220 went further, recompiling a streamlined fork of Doom's open source code using an old version of Emscripten that outputs optimized asm.js code.
With that code loaded, the Doom PDF can take inputs via the user typing in a designated text field and generate "video" output in the form of converted ASCII text fed into 200 individual text fields, each representing a horizontal line of the Doom display. The text in those fields is enough to simulate a six-color monochrome display at a "pretty poor but playable" 13 frames per second (about 80 ms per frame).
Zooming in shows the individual ASCII characters that make up a PDF <em>Doom</em> frame.
Credit:
Ading210
Despite its obvious limitations in terms of sound and color, PDF Doom also suffers from text-field input that makes it nearly impossible to perform two actions simultaneously (i.e., moving and shooting). We also have to dock at least a few coolness points because the port doesn't actually work on generic desktop versions of Adobe Acrobat—you need to load it through a Chromium-based web browser. But the project gains those coolness points back with a web front-end that lets users load generic WAD files into a playable PDF.
Critical quibbles aside, it's a bit wild playing a game of Doom in a format more commonly used for viewing tax documents and forms from your doctor's office. We eagerly look forward to the day that some enterprising hacker figures out a way to get a similar, playable Doom working on the actual printed PDF page that comes out of our printers.
In 2023, AI researchers at Meta interviewed 34 native Spanish and Mandarin speakers who lived in the US but didn’t speak English. The goal was to find out what people who constantly rely on translation in their day-to-day activities expect from an AI translation tool. What those participants wanted was basically a Star Trek universal translator or the Babel Fish from the Hitchhiker’s Guide to the Galaxy: an AI that could not only translate speech to speech in real time across multiple languages, but also preserve their voice, tone, mannerisms, and emotions. So, Meta assembled a team of over 50 people and got busy building it.
What this team came up with was a next-gen translation system called Seamless. The first building block of this system is described in Wednesday’s issue of Nature; it can translate speech among 36 different languages.
Language data problems
AI translation systems today are mostly focused on text, because huge amounts of text are available in a wide range of languages thanks to digitization and the Internet. Institutions like the United Nations or European Parliament routinely translate all their proceedings into the languages of all their member states, which means there are enormous databases comprising aligned documents prepared by professional human translators. You just needed to feed those huge, aligned text corpora into neural nets (or hidden Markov models before neural nets became all the rage) and you ended up with a reasonably good machine translation system. But there were two problems with that.
The first issue was those databases comprised formal documents, which made the AI translators default to the same boring legalese in the target language even if you tried to translate comedy. The second problem was speech—none of this included audio data.
The problem of language formality was mostly solved by including less formal sources like books, Wikipedia, and similar material in AI training databases. The scarcity of aligned audio data, however, remained. Both issues were at least theoretically manageable in high-resource languages like English or Spanish, but they got dramatically worse in low-resource languages like Icelandic or Zulu.
As a result, the AI translators we have today support an impressive number of languages in text, but things are complicated when it comes to translating speech. There are cascading systems that simply do this trick in stages. An utterance is first converted to text just as it would be in any dictation service. Then comes text-to-text translation, and finally the resulting text in the target language is synthesized into speech. Because errors accumulate at each of those stages, the performance you get this way is usually poor, and it doesn’t work in real time.
A few systems that can translate speech-to-speech directly do exist, but in most cases they only translate into English and not in the opposite way. Your foreign language interlocutor can say something to you in one of the languages supported by tools like Google’s AudioPaLM, and they will translate that to English speech, but you can’t have a conversation going both ways.
So, to pull off the Star Trek universal translator thing Meta’s interviewees dreamt about, the Seamless team started with sorting out the data scarcity problem. And they did it in a quite creative way.
Building a universal language
Warren Weaver, a mathematician and pioneer of machine translation, argued in 1949 that there might be a yet undiscovered universal language working as a common base of human communication. This common base of all our communication was exactly what the Seamless team went for in its search for data more than 70 years later. Weaver’s universal language turned out to be math—more precisely, multidimensional vectors.
Machines do not understand words as humans do. To make sense of them, they need to first turn them into sequences of numbers that represent their meaning. Those sequences of numbers are numerical vectors that are termed word embeddings. When you vectorize tens of millions of documents this way, you’ll end up with a huge multidimensional space where words with similar meaning that often go together, like “tea” and “coffee,” are placed close to each other. When you vectorize aligned text in two languages like those European Parliament proceedings, you end up with two separate vector spaces, and then you can run a neural net to learn how those two spaces map onto each other.
But the Meta team didn’t have those nicely aligned texts for all the languages they wanted to cover. So, they vectorized all texts in all languages as if they were just a single language and dumped them into one embedding space called SONAR (Sentence-level Multimodal and Language-Agnostic Representations). Once the text part was done, they went to speech data, which was vectorized using a popular W2v (word to vector) tool and added it to the same massive multilingual, multimodal space. Of course, each embedding carried metadata identifying its source language and whether it was text or speech before vectorization.
The team just used huge amounts of raw data—no fancy human labeling, no human-aligned translations. And then, the data mining magic happened.
SONAR embeddings represented entire sentences instead of single words. Part of the reason behind that was to control for differences between morphologically rich languages, where a single word may correspond to multiple words in morphologically simple languages. But the most important thing was that it ensured that sentences with similar meaning in multiple languages ended up close to each other in the vector space.
It was the same story with speech, too—a spoken sentence in one language was close to spoken sentences in other languages with similar meaning. It even worked between text and speech. So, the team simply assumed that embeddings in two different languages or two different modalities (speech or text) that are at a sufficiently close distance to each other are equivalent to the manually aligned texts of translated documents.
This produced huge amounts of automatically aligned data. The Seamless team suddenly got access to millions of aligned texts, even in low-resource languages, along with thousands of hours of transcribed audio. And they used all this data to train their next-gen translator.
Seamless translation
The automatically generated data set was augmented with human-curated texts and speech samples where possible and used to train multiple AI translation models. The largest one was called SEAMLESSM4T v2. It could translate speech to speech from 101 source languages into any of 36 output languages, and translate text to text. It would also work as an automatic speech recognition system in 96 languages, translate speech to text from 101 into 96 languages, and translate text to speech from 96 into 36 languages—all from a single unified model. It also outperformed state-of-the-art cascading systems by 8 percent in a speech-to-text and by 23 percent in a speech-to-speech translations based on the scores in Bilingual Evaluation Understudy (an algorithm commonly used to evaluate the quality of machine translation).
But it can now do even more than that. The Nature paper published by Meta’s Seamless ends at the SEAMLESSM4T models, but Nature has a long editorial process to ensure scientific accuracy. The paper published on January 15, 2025, was submitted in late November 2023. But in a quick search of the arXiv.org, a repository of not-yet-peer-reviewed papers, you can find the details of two other models that the Seamless team has already integrated on top of the SEAMLESSM4T: SeamlessStreaming and SeamlessExpressive, which take this AI even closer to making a Star Trek universal translator a reality.
SeamlessStreaming is meant to solve the translation latency problem. The baseline SEAMLESSM4T, despite all the bells and whistles, worked as a standard AI translation tool. You had to say what you wanted to say, push “translate,” and it spat out the translation. SeamlessStreaming was designed to take this experience a bit closer to what human simultaneous translator do—it translates what you’re saying as you speak in a streaming fashion. SeamlessExpressive, on the other hand, is aimed at preserving the way you express yourself in translations. When you whisper or say something in a cheerful manner or shout out with anger, SeamlessExpressive will encode the features of your voice, like tone, prosody, volume, tempo, and so on, and transfer those into the output speech in the target language.
Sadly, it still can’t do both at the same time; you can only choose to go for either streaming or expressivity, at least at the moment. Also, the expressivity variant is very limited in supported languages—it only works in English, Spanish, French, and German. But at least it’s online so you can go ahead and give it a spin.
If you want to know when I pretty much drew a line though my friendship with Neil Gaiman, it was when Neil acknowledged that he made moves on his early-20s nanny on her first day of employment. This meant that the absolute best case scenario of this whole situation was that he didn’t have the sense or wisdom to understand that making a move on a woman 40 years his junior, economically dependent on him, and whom he had met just literally hours before, was an extremely questionable idea. And by extremely questionable I mean dude what the fuck how do you not understand the actual consent issues involved here. The answer I came to is he probably did understand, and that when all was said and done, the “absolute best case scenario,” which is still very terrible, was not where we would end up. And indeed, that’s not where we are today.
And, while you should in no way consider me anywhere but on the periphery of any of this, please direct your attention and care to those who rather persuasively allege harm at his hand, it still fucking hurts. Neil’s been a friend, he’s someone whose work I’ve admired immensely, and it’s not entirely inaccurate to say that I owe a fair amount of the trajectory of my career to him. In 2006, he declined a spot on the finalist list for the Best Novel Hugo, which meant that the next book in the vote tally moved into the list. That book was Old Man’s War. We didn’t know each other at the time and he didn’t know which book would benefit from his action, but that doesn’t matter. It was a huge boost for me in the SFF community, and I thanked him for it when we finally did meet in 2009. He’s been kind to me and to my family and I’ve been happy to know him, and I think he was happy to know me.
Nothing about him having been my friend or boosting my career excuses or mitigates his actions, both alleged and admitted. This is not a defense of him. He’s done what he’s done and as noted above, the absolute best case scenario is still terrifically bad. The acknowledgement of friendship is context.
Here are two things about me, one which you know and one which you may not. The first is that I’m well-known for having public opinions on the internet, and the second is that when I get stressed and upset about things in my personal life I get real quiet and internal about it. I acknowledge this seems at least superficially contradictory, but I don’t think it is: there’s “public persona” me and there’s “private life” me. They’re both me, tuned differently, and I’ve made the point over the years that both modes exist. Usually having both is not a problem! But when someone you consider a pretty good friend who is also a public individual fucks up badly, well, then it becomes a problem. Or at least, complicated.
When the first set of allegations came out last year, I made a brief post about it and then otherwise kept quiet, because this was my friend and I needed to work out what was going on, and how a person I had as a friend had this other part to his life that was for me new and rotten information, and also there was the rest of my life to deal with, which is not insignificant. This was not enough for some people and maybe still isn’t, and that’s their opinion to have. Likewise, when I decided for myself that I was out, I didn’t make a public declaration of it. No matter how public he or I are, our friendship existed in that other sphere too, and that sphere is where I made that decision. I was out, and when it was done, in my head, it was done. Again, this will not be enough for some people, and again, that’s their opinion to have.
Why bring it up now? One, because I know other people who are being run through the same wringer with this, dealing with the person they knew and this other person they didn’t, but they’re actually the same person and now they have to integrate all of it into their understanding. I want them to know, from the bottom of my heart: fucking same. Some of these folks are friends of his. Some are fans. Some are both! All of us are sitting with it, and while, again, we are all on the periphery of harm here, it’s still something we have to work on. Some will do it publicly, some will do it privately, some will take more time than others to get where they’re going with this. They should be able to do it how they want. Maybe others should offer them some grace about it.
Two, because I’ve done my thinking about it, made my decisions, and have had time to live those decisions and am at a point where talking about it doesn’t make me feel sick or pressured to say something more than I’m prepared to say. Neil’s been a friend, and an important person to me, and someone I’ve been happy to know. But the friendship has been drawn down and done, and at this point, given everything I’ve written above, I don’t think he’ll complain much about that. He’s got a lot of work to do, and I hope he gets to it soon.
(Three, because I see some deeply shitty people hoping I’m “next,” which among other things means they are explicitly hoping that I’ve done things close to what Neil is credibly accused of, to actual other people, just so they can have the satisfaction of seeing me “owned.” And, well. Those people can go fuck themselves.)
This has been a bad crazy week, and it’s just Wednesday, in a year that’s been pretty terrible a mere fifteen days in, and which I don’t think is going to get any better from here. Take care of yourselves out there, folks.
COLORADO SPRINGS — Weeks after residents voted in favor of legalizing recreational marijuana sales in Colorado Springs, elected leaders are considering putting the issue back on the ballot in April, saying people who voted “yes” could have been mistaken.
The city council is expected to vote at its next meeting Jan. 28 whether to re-refer the issue to the April 1 ballot, when voter turnout is historically lower than general elections, claiming that “confusing” language had muddied the issue.
The move markeda further show of resistance to recreational marijuana in a city whose officials have long argued that it contributes to crime and increased drug use. Colorado Springs is the largest city in Colorado that has refused to allow the sale of recreational marijuana since it became legal in January 2014.
“It boggles my mind that we want to put it on the ballot again,” Councilwoman Yolanda Avila said Tuesday, adding that she would not support pushing the measure to another vote.
“I find that the citizens of Colorado Springs, the constituents, the voters are pretty smart,” she said. “And I think it’s so unfair that, in November was a presidential election when people get up to vote more than any other time, we are going to have the least voter turnout April 1, because we don’t even have the mayor running.”
In November, voters approved ballot question 300, with 54% marking “yes” to allow the existing medical marijuana businesses in the city to become eligible to apply for recreational licenses. The businesses would be required to comply with a 1,000-foot buffer zone from schools, day care centers and treatment facilities.
Data shows 130,677 people voted in support of the measure in a turnout that brought a record number of voters in November.
But the voter-approved initiative directly conflicted with an ordinance council members adopted in September, prior to the general election, that set the buffer to 1 mile for recreational cannabis shops if they were to be approved by voters.
A 1-mile buffer zone would effectively prevent any of the existing medical shops in the city from applying for recreational cannabis licenses.
Even as the council considers putting the issue of recreational sales back on the ballot, it voted Tuesday 6-3 to amend the buffer zone to 1,000 feet, matching the measure voters approved in November.
The city will begin accepting applications for recreational marijuana licenses no later than Feb. 10. The city has 60 days to review the applications, which would be days after the April 1 election. Only business owners who already hold a medical marijuana license are eligible to apply.
Inside a standing-room only city council chambers Tuesday, more than two dozen people argued the city council members were overcomplicating the process and ignoring the will of voters.
Colorado Springs resident Aaron Bluse, an owner of Altitude Organic Medicine which has three medical marijuana dispensaries in Colorado Springs and an adult-use shop in Dillon, said he would lose faith in the council if a second vote in April is held.
“The reason we were so adamant about our position today is that there’s a clear subversion of the will of the voters and that there’s a high level of dissonance between the council and what the voters really want in this city,” Bluse said.
“We’ll fight it completely and we will continue to enact the will of the voters, which has clearly spoken in the most turnout that the city has ever seen in the 2024 November election.”
Supporters of the measure said recreational sales would bring jobs and revenue to the city. Currently if residents want to buy recreational marijuana products, they must drive to adjacent Manitou Springs to the west, or Pueblo, about an hour south.
Among those pushing for a new ballot measure on pot was Councilman Dave Donelson, who suggested during a work session Monday that Question 300 was poorly worded and may have misled voters. He proposed that residents have the chance for a new vote to “know once and for all if these citizens want recreational marijuana in Colorado Springs or not.”
“The previous vote, I think, was confused,” Donelson said. “And I think it really could have had the impact that something passed that the majority of citizens don’t really support.”
“While we must respect the vote, we will, we also have a responsibility for public safety,” Leinweber said. “And as I hear about countless stories of youth who have had challenges with psychosis, anxiety, mental health, as a representative of the city, I have to have concerns about that.”
Kent Jarnig, a combat veteran who fought in Vietnam and chair emeritus for the El Paso County Progressive Veterans of Colorado, said THC helps him and other veterans cope with the longstanding effects of war.
“Unless you are in combat, you don’t really understand what it is like, why when they come home, when they’re in peace, all of a sudden they’re drinking, all of a sudden they’re doing drugs,” Jarnig said. “And I’m here to try and hope you will understand my words on what recreational, or what I call THC cannabis products, mean to us.”
He said all of his doctors support his use of THC.
“If each of you won’t support recreational marijuana, by simple logic, you must ban the sale of cigarettes and alcohol in Colorado Springs,” Jarnig said.
“I get the feeling that city council is going to keep putting this on the ballot until it’s falling down,” he said, with a round of applause from the audience.
Another veteran said approving the sale of recreational marijuana would mark a “significant step” toward improving access to resources for people suffering from PTSD, chronic pain and other health conditions.
On Tuesday, the council also approved an ordinance creating an additional sales tax of 5% on recreational marijuana sales in the city. The generated revenue will go into a fund that will support public safety programs, mental health services and post-traumatic stress disorder treatment programs.
City officials estimate about $2 million per year could be funneled into that fund, though the amount is highly dependent on how many medical marijuana business owners apply for licenses to sell recreational marijuana.
In 2017, the city placed a cap on the number of medical marijuana business locations and is no longer accepting applications for new centers.
Financial experts for the city expect recreational marijuana sales to bring in an additional $350,100 in sales tax revenue annually.
The debate in Colorado Springs comes as marijuana sales — and tax dollars — continue to fall statewide. While cannabis sales peaked in the 2020-21 budget year, when the state collected $424 million in sale and excise taxes, it fell 41% to $248 million in the 2023-24 budget year.
In November, I participated in a technologist roundtable about privacy and AI,
for an audience of policy folks and regulators. The discussion was great! It
also led me to realize that there a lot of things that privacy experts know
and agree on about AI… but might not be common knowledge outside our bubble.
That seems the kind of thing I should write a blog post about!
1. AI models memorize their training data
When you train a model with some input data, the model will retain a
high-fidelity copy of some data points. If you "open up" the model and analyze
it in the right way, you can reconstruct some of its input data nearly exactly.
This phenomenon is called memorization.
Memorization happens by default, to all but the most basic AI models. It's often
hard to quantify: you can't say in advance which data points will be memorized,
or how many. Even after the fact, it can be hard to measure precisely.
Memorization is also hard to avoid: most naive attempts at preventing it fail
miserably — more on this later.
Memorization can be lossy, especially with images, which aren't memorized
pixel-to-pixel. But if your training data contains things like phone numbers,
email addresses, recognizable faces… Some of it will inevitably be stored by
your AI model. This has obvious consequences for privacy considerations.
2. AI models then leak their training data
Once a model has memorized some training data, an adversary can typically
extract it, even without direct access to the internals of the model. So the
privacy risks of memorization are not theoretical: AI models don't just memorize
data, they regurgitate it as well.
In general, we don't know how to robustly prevent AI models from doing things
they're not supposed to do. That includes giving away the data they dutifully
memorized. There's a lot of research on this topic, called "adversarial machine
learning"… and it's fair to say that the attackers are winning against the
defenders by a comfortable margin.
Will this change in the future? Maybe, but I'm not holding my breath. To really
secure a thing against clever adversaries, we first have to understand how the
thing works. We do not understand how AI models work. Nothing seems to indicate
that we will figure it out in the near future.
3. Ad hoc protections don't work
There are a bunch of naive things you can do try and avoid problems 1 and 2. You
can remove obvious identifiers in your training data. You can deduplicate the
input data. You can use regularization during training. You can apply
alignment techniques after the fact to try and teach your model to not do bad
things. You can tweak your prompt and tell your chatbot to pretty please don't
reidentify people like a creep1. You can add a filter to your
language model to catch things that look bad before they reach users.
You can list all those in a nice-looking document, give it a fancy title like
"Best practices in AI privacy", and feel really good about yourself. But at
best, these will limit the chances that something goes wrong during normal
operation, and make it marginally more difficult for attackers. The model will
still have memorized a bunch of data. It will still leak some of this data if
someone finds a clever way to extract it.
Fundamental problems don't get solved by adding layers of ad hoc mitigations.
4. Robust protections exist, though their mileage may vary
To prevent AI models from memorizing their input, we know exactly one robust
method: differential privacy (DP). But crucially, DP requires you to
precisely define what you want to protect. For example, to protect individual
people, you must know which piece of data comes from which person in your
dataset. If you have a dataset with identifiers, that's easy. If you want to use
a humongous pile of data crawled from the open Web, that's not just hard: that's
fundamentally impossible.
In practice, this means that for massive AI models, you can't really protect the
massive pile of training data. This probably doesn't matter to you: chances are,
you can't afford to train one from scratch anyway. But you may want to use
sensitive data to fine-tune them, so they can perform better on some task.
There, you may be able to use DP to mitigate the memorization risks on your
sensitive data.
This still requires you to be OK with the inherent risk of the off-the-shelf
LLMs, whose privacy and compliance story boils down to "everyone else is doing
it, so it's probably fine?".
To avoid this last problem, and get robust protection, and probably get better
results… Why not train a reasonably-sized model entirely on data that you fully
understand instead?
It will likely require additional work. But it will get you higher-quality
models, with a much cleaner privacy and compliance story. Understanding your
training data better will also lead to safer models, that you can debug and
improve more easily.
5. The larger the model, the worse it gets
Every privacy problem gets worse for larger models. They memorize more training
data. They do so in ways that more difficult to predict and measure. Their
attack surface is larger. Ad hoc protections get less effective.
Larger, more complex models also make it harder to use robust privacy notions
for the entire training data. The privacy-accuracy trade-offs are steeper, the
performance costs are higher, and it typically gets more difficult to really
understand the privacy properties of the original data.
Bonus thing: AI companies are overwhelmingly dishonest
I think most privacy experts would agree with this post so far. There are
divergences of opinion when you start asking "do the benefits of AI outweigh the
risks". If you ask me, the benefits are extremely over-hyped, while the harms
(including, but not limited to, privacy risks) are very tangible and costly. But
other privacy experts I respect are more bullish on the potentials of this
technology, so I don't think there's a consensus there.
AI companies, however, do not want to carefully weigh benefits against risks.
They want to sell you more AI, so they have a strong incentive to downplay the
risks, and no ethical qualms doing so. So all these facts about privacy and AI…
they're pretty inconvenient. AI salespeople would like it a lot if
everyone — especially regulators — stayed blissfully unaware of these.
Conveniently for AI companies, things that are obvious truths to privacy experts
are not widely understood. In fact, they can be pretty counter-intuitive!
From a distance, memorization is surprising. When you train an LLM, sentences are tokenized, words are transformed into numbers, then a whole bunch of math happens. It certainly doesn't look like you copy-pasted the input anywhere.
LLMs do an impressive job at pretending to be human. It's super easy for us to antropomorphize them, and think that if we give them good enough instructions, they'll "understand", and behave well. It can seem strange that they're so vulnerable to adversarial inputs. The attacks that work on them would never work on real people!
People really want to believe that every problem can be fixed with just a little more work, a few more patches. We're very resistant to the idea that some problem might be fundamental, and not have a solution at all.
Companies building large AI models use this to their advantage, and do not
hesitate making statements that they clearly know to be false. Here's OpenAI
publishing statements like « memorization is a rare failure of the training
process ». This isn't an unintentional blunder, they know how this stuff works!
They're lying through their teeth, hoping that you won't notice.
Like every other point outlined in this post, this isn't actually AI-specific.
But that's a story for another day…
Additional remarks and further reading
On memorization: I recommend Katharine Jarmul's blog post series on the
topic. It goes into much more detail about this phenomenon and its causes, and
comes with a bunch of references. One thing I find pretty interesting is that
memorization may be unavoidable: some theoretical results
suggest that some learning tasks cannot be solved without memorizing some of the
input!
On privacy attacks on AI models: this paper is a famous
example of how to extract training data from language models. It also gives
figures on how much training data gets memorized. This paper is
another great example of how bad these attacks can be. Both come with lots of
great examples in the appendix.
On robust mitigations against memorization: this survey paper provides a
great overview of how to train AI models with DP. Depending on the use case,
achieving a meaningful privacy notion can be very tricky: this paper
discusses the specific complexities of natural language data, while this
paper outlines the subtleties of using a combination of public and
private data during AI training.
Acknowledgments
Thanks a ton to Alexander Knop, Amartya Sanyal, Gavin Brown, Joe Near, Marika
Swanberg, and Thomas Steinke for their excellent feedback on earlier versions of
this post.
“Companies building large AI models use this to their advantage, and do not hesitate making statements that they clearly know to be false. Here's OpenAI publishing statements like « memorization is a rare failure of the training process ». This isn't an unintentional blunder, they know how this stuff works! They're lying through their teeth, hoping that you won't notice.”