Identifying toxic content can be a first step in addressing it
Trolls, haters and other ugly characters are sadly a reality on much of the internet. Their ugliness can taint social networks and sites like Reddit and Wikipedia.
But toxic content looks different depending on location, and identifying toxicity online is a first step in getting rid of it.
A team of researchers from the Institute for Software Research (ISR) in the School of Computer Science at Carnegie Mellon University recently collaborated with colleagues at Wesleyan University to take a first step in understanding toxicity on open source platforms like GitHub.
“You need to know what that toxicity looks like in order to design tools to manage it,” said Courtney Miller, Ph.D. student at ISR and main author of the article. “And managing that toxicity can lead to places that are healthier, more inclusive, more diverse, and just better overall.”
To better understand what toxicity looked like in the open source community, the team first gathered toxic content. They used a toxicity and politeness detector developed for another platform to scan almost 28 million posts on GitHub published between March and May 2020. The team also searched these posts for a “code of conduct” – an expression often invoked when reacting to toxic content – and looking for locked or suppressed issues, which can also be a sign of toxicity.
Through this curation process, the team developed a final dataset of 100 toxic publications. They then used this data to study the nature of the toxicity. Was it insulting, titled, arrogant, trolling, or unprofessional? Was it directed at the code itself, at people, or something else?
“The toxicity is different in open source communities,” Miller said. “It’s more contextual, titled, subtle and passive-aggressive.”
Only about half of the toxic messages identified by the team contained obscenities. Others came from demanding users of the software. Some have come from users who post a lot of issues on GitHub but contribute little else. Comments that started with software code became personal. None of the posts helped improve the open source software or the community.
“Worst. App. Ever. Please don’t make it the worst app. Thank you,” one user wrote in a post included in the dataset.
The team noticed a unique trend in how people reacted to toxicity on open source platforms. Often, the project developer went out of their way to accommodate the user or resolve issues raised by the toxic content. This regularly led to frustration.
“They wanted to give the benefit of the doubt and create a solution,” Miller said. “But it turned out to be rather trying.”
Reaction to the document has been strong and positive, Miller said. Open source developers and community members were thrilled that this research was taking place and that the behavior they had been dealing with for a long time was finally acknowledged.
“We’ve been hearing from developers and community members for a very long time about the unfortunate, almost ingrained toxicity of open source,” Miller said. “Open source communities are a little rough around the edges. They often have horrible diversity and retention, and it’s important that we start to address and manage the toxicity there to make it a more inclusive and better place. “
Miller hopes the research will create a foundation for more and better work in this area. His team stopped short of building a toxicity detector for the open source community, but the groundwork was laid.
“There’s so much work to do in this space,” Miller said. “I really hope people see this, develop it, and keep pushing things forward.”
Joining Miller on the job were Daniel Klug, a systems scientist in ISR; ISR professors Bogdan Vasilescu and Christian Kästner; and Sophie Cohen of Wesleyan University. The team document, “Did you miss my comment or what?” Understanding Toxicity in Open-Source Discussions,” was presented at the ACM/IEEE International Conference on Software Engineering last month in Pittsburgh, where it won a Distinguished Paper award.