I discovered The Gender Genie from LaughingMeme, which led me to Moshe Koppel and Shlomo Argamon’s algorithm, described in Nature and the New York Times Magazine. The Koppel-Argamon algorithm analyzes the text and guesses the author’s gender. (90)
The algorithm was very simple, so I implemented it as a Perl module — Lingua::EN::Gender. I just registered for a PAUSE ID, and will upload Lingua::EN::Gender to CPAN as soon as my registration is confirmed. (91)
I went to Project Gutenberg for some test data, and chose the first chapters of Charles Dickens’s A Tale of Two Cities and Herman Melville’s Moby Dick, and the second chapter of Charlotte Bronte’s Jane Eyre. The module correctly guessed the genders of these authors. (92)
Then, I decided to have a little fun. I tried the module on the preface of George W. Sands’s Mazelli, and Other Poems. Sands, of course, was the pen name of Aurore Dupin, a woman. Sure enough, the module correctly identified the author as a woman. (93)
I wanted to try the reverse test (a man writing as a woman), so I searched for Mark Twain’s short story, “Eve’s Diary.” Sadly, Project Gutenberg did not have this. However, it did have Twain’s “Extracts from Adam’s Diary,” so for kicks, I tried that. The module incorrectly guessed that this was authored by a woman! (94)
Now that my euphoria was officially dead, I decided to test my old blog entries. The module claimed that 23 out of my 34 entries were written by females! Scientific proof that I’m a heckuva sensitive fella. (95)
Lest you think I have too much time on my hands, here’s the work rationale for playing with this algorithm. Last January, I met Freada Kapor Klein, who is interested in diversity in the workplace, and who heads up the Level Playing Field Institute. She encouraged me to consider diversity in online communities as an area of research. One of the difficulties with this is that the only way to gather demographic data such as race or gender is via surveys. A tool that could accurately determine gender based on the author’s prose would be very handy. (96)
I only tested two posts, but both registered as female.
It’s especially interesting since you calculate numbers incorrectly. Could that be it? I also noticed that if you went the other way, “100,000” would be counted incorrectly as well.
I tried out your algorithm on my own work and discovered that out of 250 articles, 239 were classified as “written by a female” and 11 as “written by a male.” So much for accuracy.
I’m thinking about re-implementing the algorithm, but the abstract claims that the technique “can also be used to distinguish fiction from non-fiction,” so the algorithm as published in the NYT is probably inaccurate or incomplete. Time to head to the university library and read the actual algorithm.
went back and read the nytimes article laying out the algorithim. what one might extrapolate from the algorithim is that men write more formally, e.g. using articles (the, a), and possibly reference more inanimate or abstract objects (the table, the woman, a cat) (the use of “it”), and women write more about relationships, thus the use of possessives and “with”. so it seems to me.