The Wall Street Journalhas an incredible story today. The National Archives museum, under Biden-appointed U.S. Archivist Colleen Shogan, has been working to reshape its narrative of American history in order to make white conservatives more comfortable. The Journal describes a pattern of efforts to shape its newest upcoming exhibits to better fit right-wing narratives of U.S. history. The museum has removed references to Martin Luther King Jr., Japanese internment, Native Americans, union organizers, and birth control, because presenting American history honestly would make Republicans upset.
You might think that, come the artificial general intelligence, humanity will enjoy a world of unlimited abundance and prosperity for all, beyond money, under the watchful sensors of the superintelligent AI.
But Silicon Valley bros talking up the Singularity still want to win at capitalism. So alongside OpenAI, Sam Altman’s other big project is a cryptocurrency called Worldcoin — which just rebranded to World Network, or World for short, to slightly reduce the coiner stench. [Reuters; YouTube]
Sam “partnered” Worldcoin with OpenAI earlier this year. [Bloomberg,archive]
Worldcoin investors — which include Andreessen Horowitz (a16z), Coinbase Ventures, Digital Currency Group, Sam Bankman-Fried of FTX, and LinkedIn cofounder Reid Hoffman — have put in $244 million. Investors own 13.5% of all the 10 billion worldcoins. The developers get 9.8%. [Crunchbase; white paper]
Altman gives away worldcoins free! He’s just nice like that, see. He makes sure coins only go to individuals by … collecting scans of their eyeballs.
Now, you might think Worldcoin was some sort of crypto pump-and-dump with a sideline in exploitable personal data.
What a Worldcoin is for
Altman promotes Worldcoin as a way to end poverty — and not just a way for him to collect huge amounts of biometric data. He posits a universal basic income, paid in worldcoins.
His project aims to solve a problem that doesn’t exist — “verifying your humanness” in a world dominated by AI agents. Note that this is a problem that Altman is also attempting to cause. Altman imagines AI becoming so powerful you won’t be able to tell humans from chatbots. So you can use your eyeball for sign-in. [CNN]
Worldcoin is telling the press that companies such as Reddit and Discord are working with them on iris-based sign-in. This is false — Worldcoin is just working on using those sites’ public APIs. Reddit has had to make it clear that they have no official Worldcoin integration. [Business Insider, archive; TechCrunch]
Altman and Worldcoin CEO Alex Blania had been “noodling” on the idea of a cryptocurrency since 2019. Altman unveiled the iris-scanning “Orb” in October 2021. [Twitter, archive]
The Worldcoin crypto, WLD, launched in July 2023. In exchange for an iris scan from an orb, you get 25 worldcoins! [CoinDesk, 2023]
Once your eyeball is scanned, the system creates a cryptographic key pair. The public key gets stored with your iris on the Worldcoin blockchain and the private key is stored in the WorldApp application on your phone. If you drop your phone in a puddle, all your worldcoins are gone.
Using Worldcoin
Free money is a powerful lure. When Worldcoin launched in July 2023, Kenyans queued up to get their eyeballs scanned.
Converting worldcoins into actual money you can spend is fraught. The WLD token trades only against the tether stablecoin and mostly on confessed criminal crypto exchange Binance. [CoinDesk, 2023;CoinGecko]
WLD tokens live on World Chain, a blockchain implemented as an Ethereum layer 2 sidechain. So WLD hooks into the rest of the crypto trading system and you can lose all your WLD pretending you’re a hotshot DeFi trader. [Cointelegraph]
Or you could take your chances with over-the-counter buyers, offering you ready cash for WLD at a discount — and who may disappear without ever paying you.
So Worldcoin is a playground for scammers: [Rest of World, 2023]
“There’s no regulation in the space, and the people receiving the free tokens don’t have enough information. What do you expect?” Evrard Otieno, a Nairobi-based crypto trader and software developer, told Rest of World. “It’s just another opportunity for traders to make some money in the market.”
In the quest for free money, brawls even broke out in Germany in March this year. Poor people were being recruited by gangs to get scanned for worldcoins — the subject would get about 100 EUR worth of WLD and their handlers would give them 50 EUR in cash for it right away. Worldcoin staff tried to block these “undesirable” users who didn’t know or care about the fabulous crypto dream, and the poor people and their handlers got angry. [DLNews]
WLD is not officially available in the US — it’s really obviously a security under US regulations. US users can still get their eyeballs scanned if they like, though they won’t get any WLD.
Make worldcoins operating an orb!
Worldcoin distributes the orbs to operators around the world under the name Tools For Humanity. It’s a sort of multi-level marketing arrangement, complete with a three-day sales conference in Dubai for orb operators. The operators get paid in tethers. [MIT Technology Review, 2022, archive]
The orb distribution enterprise had a shaky start when it turned out to be filled with crooks and corruption from top to bottom. Hundreds of people got scanned and were never paid their WLD. The operators were paid late and had onerous signup quotas. The orbs frequently failed to work.
“We didn’t want to build hardware devices,” said Blania. “We didn’t want to build a biometric device, even. It’s just the only solution we found.” Perhaps you could try not doing all of this, then. [Buzzfeed News, 2022]
In its push for more eyeballs, this year Worldcoin introduced a new version of the orb for mass production in Germany.
Worldcoin is also opening new orb scanning venues in Mexico City and Buenos Aires. Though a partnership with the delivery app Rappi, in parts of Latin America you can have a new-model orb delivered to your door on demand, “like a pizza.” You get your scans and worldcoins and the orb goes on to the next sucker. [Wired, archive]
About that biometric data
It’s entirely unclear if people submitting to the scans understand how their biometric data will be used — and how secure the data is.
Days after the 2023 launch, the Communications Authority of Kenya and the Office of the Data Protection Commission ordered Worldcoin to suspend operations while they reviewed the project’s privacy protections. [Twitter, 2023,archive]
The Kenyan police raided Worldcoin’s Nairobi warehouse on August 5, 2023 and seized the orbs. [KahawaTungu, 2023].
Data watchdogs in Britain, France, and Germany started investigating — it’s utterly unclear how any of this works with the GDPR. The Electronic Privacy Information Center in Washington urged further government scrutiny: [ICO, 2023;Reuters, 2023; EPIC, 2023]
Worldcoin’s approach creates serious privacy risks by bribing the poorest and most vulnerable people to turn over unchangeable biometrics like iris scans and facial recognition images in exchange for a small payout.
Worldcoin has already created a black market in biometric data. [Wired, 2021;Reuters, 2023;Gizmodo, 2023]
Kenya has since allowed Worldcoin to proceed. But it’s also facing scrutiny in Colombia, Hong Kong, and Argentina. Orb scans and WLD trading are not legal in Singapore. Spain and Portugal issued temporary bans over privacy concerns. [MAS; AEPD; CNPD, PDF]
Will any of this work?
The investors’ worldcoins are unlocked as of July 2024, if they want to sell up. At present, the World team appears to be making sure the token stays nicely pumped up for the insiders. [Twitter, archive]
The main issue is that there’s no real world demand, so there’s no liquidity — Worldcoin is just another minor crypto altcoin for speculators and any substantial dump will crash it.
Blania this year gave Worldcoin “maybe a 5% chance of succeeding” — though whether at saving humanity from economic destruction by the future AI or perhaps at being allowed to run at all was not clarified. [Bloomberg, archive]
Printing their own pump-and-dump magical crypto money and collecting a vast pile of exploitable biometric data will just be consolation prizes, then.
As employers increasingly use digital tools to process job applications, a new study from the University of Washington highlights the potential for significant racial and gender bias when using AI to screen resumes.
The UW researchers tested three open-source, large language models (LLMs) and found they favored resumes from white-associated names 85% of the time, and female-associated names 11% of the time. Over the 3 million job, race and gender combinations tested, Black men fared the worst with the models preferring other candidates nearly 100% of the time.
Why do machines have such a outsized bias for picking white male job candidates? The answer is a digital take on the old adage “you are what you eat.”
“These groups have existing privileges in society that show up in training data, [the] model learns from that training data, and then either reproduces or amplifies the exact same patterns in its own decision-making tasks,” said Kyra Wilson, a doctoral student at the UW’s Information School.
Wilson conducted the research with Aylin Caliskan, a UW assistant professor in the iSchool. They presented their results last week at the AAAI/ACM Conference on Artificial Intelligence, Ethics and Society in San Jose, Calif.
The experiment used 554 resumes and 571 job descriptions taken from real-world documents.
The researchers then doctored the resumes, swapping in 120 first names generally associated with people who are male, female, Black and/or white. The jobs included were chief executive, marketing and sales manager, miscellaneous manager, human resources worker, accountant and auditor, miscellaneous engineer, secondary school teacher, designer, and miscellaneous sales and related worker.
The results demonstrated gender and race bias, said Wilson, as well as intersectional bias when gender and race are combined.
One surprising result: the technology preferred white men even for roles that employment data show are more commonly held by women, such as HR workers.
This is just the latest study to reveal troubling biases with AI models — and how to fix them is “a huge, open question,” Wilson said.
It’s difficult for researchers to probe commercial models as most are proprietary black boxes, she said. And companies don’t have to disclose patterns or biases in their results, creating a void of information around the problem.
Simply removing names from resumes won’t fix the issue because the technology can infer someone’s identity from their educational history, cities they live in, and even word choices for describing their professional experiences, Wilson said. An important part of the solution will be model developers producing training datasets that don’t contain biases in the first place.
The UW scientists focused on open-source LLMs from Salesforce, Contextual AI and Mistral. The models chosen for the study were top-performing, Massive Text Embedding (MTE) models, which are a specific type of LLMs trained to produce numerical representations of documents, allowing them to be more easily compared to each other. That’s in contrast to LLMs like ChatGPT that are trained for generating language.
The authors noted that numerous previous studies have investigated foundation LLMs for bias, but few have looked at MTEs in this application, “adding further novelty and importance to this study.”
Spokespeople for Salesforce and Contextual AI said the LLMs used in the UW research were not intended for this sort of application by actual employers.
The Salesforce model included in the study was released “to the open source community for research purposes only, not for use in real world production scenarios. Any models offered for production use go through rigorous testing for toxicity and bias before they’re released, and our AI offerings include guardrails and controls to protect customer data and prevent harmful outputs,” said a Salesforce spokesperson by email.
Jay Chen, vice president of marketing for Contextual AI, said the LLM used was based on technology from Mistral and is not a commercial Contextual AI product.
“That being said, we agree that bias and ethical use of AI is an important issue today, and we work with all of our customers to mitigate sources of bias in our commercial AI solutions,” Chen said by email.
Mistral did not respond to GeekWire’s request for a comment.
While the propensity of bias in different software solutions for screening resumes is not known, some elected leaders are taking initial steps to help address the issue.
In a move to provide more comprehensive safeguards against discrimination, California passed a state law making intersectionality a protected characteristic, in addition to identities such as race and gender alone. The rule is not specific to AI-related biases.
New York City has a new law requiring companies using AI hiring systems to disclose how they perform. There are exemptions, however, if humans are still involved in the process.
But in an ironic twist, that can potentially make the selections even more biased, Wilson said, as people will sometimes put more trust in a decision from technology than humans. Her next research will focus on how human decision makers are interacting with these AI systems.
One of the great dreams of AI snake oil is a machine that will violate employment laws in a deniable black-box manner. Whether it works doesn’t actually matter. So AI keeps promising phrenology machines.
Microsoft claimed its Image Analysis API could determine age, gender, and emotion. The tech was so often incorrect, as well as actively harmful, that Microsoft declared they would retire it as of June 30, 2023. [Blog]
Saying things is easier than doing them, so Microsoft didn’t actually switch it off. Ada Ada Ada, an algorithmic artist who runs pictures of herself through various AI phrenology machines, set up Microsoft Image Analysis before the switchoff date — and still had access to it as of last month. [404, archive]
You can’t tell gender by simply analyzing someone’s face. It just doesn’t work reliably. Our good friend Os Keyes has written at length on the subject. [LogicMag, 2019]
This doesn’t stop the companies for a second. Nyckel “recognizes that gender is a complex spectrum and a social construct,” but still sells digital phrenology that doesn’t work. [Nyckel]
Machine learning has a long history of being racist as well as sexist, so gender detectors consistently misclassify Oprah Winfrey, Michelle Obama, and Serena Williams. [Time, 2019; MLR, 2018, PDF]
You can’t tell gender from other body parts either. When does a nipple become female? Instagram’s AI has some sort of answer! [404, archive]
Digital civil rights nonprofit Access Now has a petition to ban alleged gender recognition software in the regulatory framework for AI. [Access Now]
Tech’s favorite party trick is promoting programmers into leadership roles with zero transition coaching, or even a briefing on what the role entails. The programmer accepts the promotion because…I mean, of course you’d accept a promotion. Then, they quickly find themselves in over their heads.
In my experience, it is at least the case that when programmers become trial-by-fire managers, they realize they don’t know how to do their jobs. Technical leadership—tech lead roles, principal eng roles, and even the dreaded “player-coach” role—those sneak up on people. A lot of times there’s still programming involved, so folks feel prepared. Their experience has exposed them to technical decisions and it got them promoted, so the way they do it is probably fine. Right?
The thing is, certain decision-making pitfalls have limited negative impact at the “line programmer” level. They even appear in discussions of tech culture as lovable idiosyncrasies common among the gearheads. They become less lovable as the gearheads get more power, though, and those pitfalls produce a larger impact crater. Three in particular come up repeatedly and cause projects to falter or fail. Let’s go over them.
Pitfall #1: Assuming Context
What it is: Making decisions according to common practice, or commonly praised practice, without understanding why those practices are common/praised and whether they apply to this specific project.
Negative Impacts:
Inability to precisely answer questions from the team or from leadership about why we’re doing this (“it’s a best practice” is not an answer)
Inappropriate solutions that fail to solve context-specific problems
Inappropriate solutions that produce negative externalities and then turn the whole team against the use of that practice. A common example here is automated tests, written at too low a level of abstraction to precisely accomplish the goal of testing, then producing spurious failures on CI or making it a chore to change implementation details such that the team sours on automated testing.
The scenic explanation:
Let’s take a look at a truism that echoes throughout the hallway track at programming conferences galore:
Is this true? Do you agree with this statement?
You see the word there underlined: “Generally”. What do we mean by generally?
“Generally” means “in most cases,” and when we say things like this, we’re often skipping a discussion of which cases we’re talking about. .
Generally, it’s better to optimize our code for legibility than for speed. Maybe that’s true when we’re writing end user client applications. Maybe that’s true when we’re writing open source applications in a distributed team whose communication mostly happens asynchronously. In this situation it’s common for one developer to need to understand and maintain the code that another developer wrote.
Several years ago, I started implementing compilers and interpreters. For the first one I was following the book Crafting Interpreters by Bob Nystrom. As I went through the beginning of the book, during the lexing portion where we identify tokens in the text of the source code, I came across a giant case statement that switched on all the tokens. And I decided that I could do a better job than that of expressing the hierarchy of tokens in my compiler. So I refactored that code into a series of conditionals, and I thought it read so cleanly, and I felt so smart. And I reached out to Bob and I asked him, out of curiosity, why is it done with a case statement, when other solutions might be more legible?
And he said he agreed that the tiered conditionals added legibility, but the truth is that when you are writing a programming language, it’s extremely important that the thing be fast. Developers don’t have to worry about the speed of looping over fourteen keys and values in a Ruby app, or a Python app, or a Java app, because the people who wrote the compilers for those languages made them fast enough to afford people writing in the language the luxury of optimizing for legibility. If they didn’t, no one would use the language.
Let’s return to the question: what does “generally” mean? Maybe “in most cases” is good enough for making decisions as line programmers. But once we’re in leadership positions, we’re responsible for knowing which cases are (and aren’t) the cases that respond well to a practice we’re considering.
How I think technical leaders should approach this:
Know what specific cases benefit from a candidate solution, so you can determine whether your case matches the profile for which this solution works well. A rigorous engineering leader can justify a choice with a complete account of that solution’s tradeoffs and how they apply in this specific circumstance. Such an account might look like this: “Our use case has X needs and Y vulnerabilities. This approach has A benefits and B risks. X lined up with A and Y didn’t line up with B, and that’s why I chose this.”
If you can describe the tradeoffs of the technical choices that you or your team are making, you’ll be several steps ahead of the best practice conversation.
Pitfall #2: Treating Everything as an Optimizing Metric
What it is: Making a long list of characteristics to consider while making a decision between two or more dependencies, architectures, or strategies, and then forestalling or repeatedly reopening the decision until one option outperforms all the others on all of the characteristics.
Negative Impacts:
Decisions take forever and sometimes never actually complete
Revisited decisions generate ongoing work as team members get assigned the job of switching back and forth, back and forth
Team gets a reputation for being hard to work with due to being unable to move forward on a decision, particularly when other teams depend on that decision being made
The scenic explanation:
When engineers get the opportunity to make design and—maybe especially—tooling decisions, it’s normal, easy, and common to fall into an evaluation trap that exhausts a lot of energy on the wrong things. It usually involves first making a list of all the characteristics that might matter about the outcome of this decision, and then treating every single one of these characteristics as equally important and worthy of maximizing. Because many of these characteristics have tradeoff relationships with each other, it’s empirically impossible to identify a solution that outperforms all the alternatives on every single one of the metrics, so teams get to do a lot of hemming and hawing about the decision forever, without arriving at a solution they’re happy with, and always having fodder to denigrate whomever the last engineer is that made the decision rather than understanding the role of context and tradeoffs in how to make this decision well.
How I think technical leaders should approach this:
To explain this, I want to introduce you to the idea of optimizingand satisficing metrics.
Optimizing metrics are the ones for which more is always better: the ones for which, no matter how much we already have, more will always improve our product outcome. Code legibility might be a good example of this in the specific context of end user applications developed by a team of developers—and particularly one that sometimes experiences churn, but I think that’s all of them. In a situation like this, we almost never seem to reach a point where additional code legibility efforts feel superfluous, so deliberately improving code’s legibility almostalways benefits our maintenance efforts.
Satisficing metrics are the ones for which we have a clear idea of how good is good enough. More of this does not improve our product outcome. A great example is performance: your app’s animations only look smoother up to the frame rate that the human eye can visually process (which, by the way, is about 60 hz). So that’s a pretty good satisficing metric for animation quality.
When you’re making decisions, narrow down the list of optimizing metrics to as few as possible—zero if you can, one or two realistically. Then, establish the threshold at which your options would satisfice on all of the other metrics. You can use this to narrow down the options to those that meet the satisficing metrics, and after that point, ignore those metrics and focus exclusively on the optimizing ones.
Pitfall #3: Manufacturing Emergencies
What it is: Making decisions that result inyour team constantly or frequently facing short-term time pressure. This happens when leaders:
Overstate the negative impact to constituents of not meeting a short-term (<24 work hours) deadline
Generate consequences for their team members on top of the negative impacts to constituents, to artificially exaggerate the direness of the situation for their team
Frequently subject their team members to short deadlines caused be foreseeable problems that could have been avoided had leadership done the appropriate amount of preparation and planning
Negative Impacts:
Deteriorating systems as teams spend time in “emergency mode,” where they’re incentivized to write patch jobs and other poorly considered code solutions
Deteriorating morale as team members tire of the extra time and extra stress applied to fulfilling their duties
Deteriorating trust in leadership as team members figure out that most of these “emergencies” either weren’t really emergent or could have been prevented by people who failed them as leaders
The scenic explanation:
Before I became a programmer, I spent ten years as a coxswain: the person who sits in the back of the boat, steers, and yells at rowers. I started driving boats at age fourteen and I kept it up for ten years.1 I started my first blog about coaching coxswains. During my junior year in college, my collegiate coach decided I “shine under pressure.”
By then I had been coxing for seven years and systematically journaling about it for five years. I had seen (and journaled about how to handle) a lot of adverse circumstances, so I had a broad library of strategies to choose from under those circumstances. I developed a track record for responding well when the shit hit the fan.
But then, my coach decided that the way to take advantage of a coxswain who can “shine under pressure” was to deliberately put them in stressful situations to get them to “shine” as much as possible.
Dear reader, it might be the case that some people handle pressure better than others. It is almost never the case that someone under pressure outperforms a version of themselves who is not under pressure. If you want to experience people’s best work, it is often in your best interest to relieve pressure—not artificially apply it.
Clearly, that’s not always possible. Some fields have real emergencies: medicine, crisis response, catching violent criminals. Software engineering, dear reader, is almost never one of those fields. The weird exception you’re racking your brain for right now, though perhaps existent, does not make software engineering, by and large, one of those fields.
Almost every urgent situation that a technical contributor has ever faced, or will ever face, is manufactured. The time pressure either could have been prevented, or the consequences of this task not being handled right now, at three in the morning or whatever, don’t stretch beyond “the page won’t load for a couple of night owls, and some manager will be upset.”
For some eldritch reason, tech culture has come to romanticize the manufactured emergency instead of denigrating it for the leadership failure that it almost always is.
It’s the difference between—as an example—a head chef who runs a kitchen well and calmly, such that his staff are all also calm and able to do their jobs, vs a head chef who also has a reputation for keeping a tight kitchen, but does so by screaming and making his staff’s job a gauntlet. Both chefs can achieve fame and fortune. One of them, people will only want to work for to have it on their resume that they survived. The other one is the one someone would actually want to work for.
Under the calm one, the work you do makes you a better practitioner. Under the freaking out one, the work you do makes you stressed, and to the extent that you become a better practitioner, it’s mostly because of investments that you make outside your job: independent study, practicing your egg flip at home, et cetera, because improvement requires challenge, which begets failing sometimes, and that’s a lot costlier under the freakout chef than under the calm chef.
The reason both head chefs produce good staff chefs is that the first one teaches them to excel, and the second one selects for people who respond to abuse by teaching themselves to excel. There are good cooks who can become great chefs under the first one, and could not do so under the second one. Not nearly as true the other way around: people who survive the second chef were making it to the top regardless. The supply of people who want to be chefs, relative to the demand for chefs in prestigious kitchens, is much, much, much higher than the supply of capable software engineers relative to the demand for them. So technical leaders who cosplay freakout chefs see much worse results than actual freakout chefs do.
How I think technical leaders should approach this:
I think you have two types of manufactured emergencies to consider.
The first type of manufactured emergency is the type you yourself manufacture. You can manufacture an emergency either by failing to anticipate likely future events, or by treating situations as more dire than they actually are. Some examples:
The team has to hack together a time bug workaround on short notice because you did not anticipate that this time bug was about to happen.
Your team has to deploy on a specific cloud computing platform on short notice because you courted a client who insisted on using this platform, did not manage that client’s expectations on time-to-production on this platform, and did not give your team advance warning that we would acquiesce to this client’s platform and timeline expectations.
You assign your team an on-call rotation for a new service that enables a bookmarking function since on-call is a “best practice” for production software. The unstable service regularly pages people at 3:00 AM, who then get up to fix bookmarks—not a service that will cause anyone grave harm if it’s down for a few hours.
Every once in a while as a leader, you’ll screw up and your team will have to make up for it. It happens; you don’t need to pillory yourself over it. But if your team regularly ends up in rush job situations…that’s probably you. It’s time to consider what signals you’re missing, what communications you’re failing to provide, or what minor blips you’re treating as major problems in front of your team.
The second type of manufactured emergency is the type created and then thrust upon you by leadership above you. In my opinion, these suck more than the first type because you have a lot less control over whether they keep happening.
When you, as a technical leader, receive pressure downwards from executives above you, you have three options:
Try to block that pressure from affecting your team
Pass that pressure directly through yourself and onto your team
Amplify that pressure downward, so it’s worse on your team than it was on you.
Option 1 is by far the best one for your team’s performance, and it’s also by far the most difficult. To lead like this, you need not only the skills to manage upward, but also the skills to understand the forces undergirding the pressure—often in situations where this information was withheld from you. Then, you need to develop strategies and success criteria that your team can use to execute without needing to worry about the fact that VP yelled at you or whatever. I have seen exactly two managers, ever, consistently succeed at Option 1. They both had an enormous amount of experience in leadership as well as almost zero fear about losing their jobs.2 That doesn’t describe most technical leaders.
So I think a reasonable goal for a majority of technical leaders, apropos of these three options, would be to aim between options 1 and 2. Work on transparent communication with your team about what has happened, what you think leadership above you needs to see from your team as a result, and how you think we can get there with as little kerfuffle as possible. Offer what support you can, and work on helping leadership above you with their ability to anticipate eventualities and accurately assess how big a deal something is. I know that this option is also quite hard, but I have a lot of faith in you.
I think each of these three decision-making pitfalls could merit its own treatment about specific tactics. For now, I’m hopeful that seeing them in a short list, with specific names attached to each one, makes them easier for technical leaders to spot. Once you can spot these pitfalls, you have a much better chance at avoiding them in the future, and opting for an alternate decision-making strategies with a more favorable distribution of possible results.
Then I got too heavy. To be honest, before that I had become too heavy. My coach compared me to another coxswain, in front of the entire boathouse of all three collegiate rowing teams, by holding a small dumbbell above his head. ↩︎
One of these people had zero fear about losing his job because every executive at the company knew the company could not continue to function without him. The other one had zero fear about losing his job because he possessed enough confidence in his savings strategy, his network, and his interview ability to say “if I get fired, I’ll be fine.” One of the reasons I coach my mentees on financial management and relationship building is that, even if you want to have an ethical and impactful career, often your ability to do so depends on your ability to do and say things that feel like they could, in a worst case scenario, jeopardize your role at the organization you’re doing/saying them to. It’s very hard to hold someone accountable when you depend on them to survive. ↩︎
Anyone familiar with HR practices probably knows of the decades of studies showing that résumé with Black- and/or female-presenting names at the top get fewer callbacks and interviews than those with white- and/or male-presenting names—even if the rest of the résumé is identical. A new study shows those same kinds of biases also show up when large language models are used to evaluate résumés instead of humans.
In a new paper published during last month's AAAI/ACM Conference on AI, Ethics and Society, two University of Washington researchers ran hundreds of publicly available résumés and job descriptions through three different Massive Text Embedding (MTE) models. These models—based on the Mistal-7B LLM—had each been fine-tuned with slightly different sets of data to improve on the base LLM's abilities in "representational tasks including document retrieval, classification, and clustering," according to the researchers, and had achieved "state-of-the-art performance" in the MTEB benchmark.
Rather than asking for precise term matches from the job description or evaluating via a prompt (e.g., "does this résumé fit the job description?"), the researchers used the MTEs to generate embedded relevance scores for each résumé and job description pairing. To measure potential bias, the résuméwere first run through the MTEs without any names (to check for reliability) and were then run again with various names that achieved high racial and gender "distinctiveness scores" based on their actual use across groups in the general population. The top 10 percent of résumés that the MTEs judged as most similar for each job description were then analyzed to see if the names for any race or gender groups were chosen at higher or lower rates than expected.
A consistent pattern
Across more than three million résumé and job description comparisons, some pretty clear biases appeared. In all three MTE models, white names were preferred in a full 85.1 percent of the conducted tests, compared to Black names being preferred in just 8.6 percent (the remainder showed score differences close enough to zero to be judged insignificant). When it came to gendered names, the male name was preferred in 51.9 percent of tests, compared to 11.1 percent where the female name was preferred. The results could be even clearer in "intersectional" comparisons involving both race and gender; Black male names were preferred to white male names in "0% of bias tests," the researchers wrote.
These trends were consistent across job descriptions, regardless of any societal patterns for the gender and/or racial split of that job in the real world. That suggests to the researchers that this kind of bias is "a consequence of default model preferences rather than occupational patterns learned during training." The models seem to treat "masculine and White concepts... as the 'default' value... with other identities diverging from this rather than a set of equally distinct alternatives," according to the researchers.
The preference shown by these models toward or against any one group in each test was often quite small. The measured "percentage difference in screening advantage" was around 5 percent or lower in the vast majority of comparisons, which is smaller than the differential preference rates shown by many human recruiters in other studies. Still, the overwhelming consistency of the MTEs' preference toward white and/or male names across the tests adds up across many different job descriptions and roles.
The results in this controlled study might also not match how recruiters use AI tools in the real world. A Salesforce spokesperson told Geekwire that "any models offered for production use go through rigorous testing for toxicity and bias before they’re released, and our AI offerings include guardrails and controls to protect customer data and prevent harmful outputs."
Weve literally been pointing this out since the early days of machine learning, At least the mid 2000’s. Well before “AI” was a thing (the most recent time around)