I Blog Like a Girl!

I discovered The Gender Genie from LaughingMeme, which led me to Moshe Koppel and Shlomo Argamon’s algorithm, described in Nature and the New York Times Magazine. The Koppel-Argamon algorithm analyzes the text and guesses the author’s gender.    (90)

The algorithm was very simple, so I implemented it as a Perl module — Lingua::EN::Gender. I just registered for a PAUSE ID, and will upload Lingua::EN::Gender to CPAN as soon as my registration is confirmed.    (91)

I went to Project Gutenberg for some test data, and chose the first chapters of Charles Dickens’s A Tale of Two Cities and Herman Melville’s Moby Dick, and the second chapter of Charlotte Bronte’s Jane Eyre. The module correctly guessed the genders of these authors.    (92)

Then, I decided to have a little fun. I tried the module on the preface of George W. Sands’s Mazelli, and Other Poems. Sands, of course, was the pen name of Aurore Dupin, a woman. Sure enough, the module correctly identified the author as a woman.    (93)

I wanted to try the reverse test (a man writing as a woman), so I searched for Mark Twain’s short story, “Eve’s Diary.” Sadly, Project Gutenberg did not have this. However, it did have Twain’s “Extracts from Adam’s Diary,” so for kicks, I tried that. The module incorrectly guessed that this was authored by a woman!    (94)

Now that my euphoria was officially dead, I decided to test my old blog entries. The module claimed that 23 out of my 34 entries were written by females! Scientific proof that I’m a heckuva sensitive fella.    (95)

Lest you think I have too much time on my hands, here’s the work rationale for playing with this algorithm. Last January, I met Freada Kapor Klein, who is interested in diversity in the workplace, and who heads up the Level Playing Field Institute. She encouraged me to consider diversity in online communities as an area of research. One of the difficulties with this is that the only way to gather demographic data such as race or gender is via surveys. A tool that could accurately determine gender based on the author’s prose would be very handy.    (96)

More on Marc Smith and Joshua Tyler

Going through some old notes, I found some other references to both Marc Smith and Joshua Tyler. Marc wrote about Netscan in, “Tools for Navigating Large Social Cyberspaces” (Communications of the ACM, April 2002, pp51-55). Joshua has a paper from when he was at Stanford entitled, “When Will You Read And Reply to My Email? A Study of Rhythms and Temporal Patterns in Email Use.”    (8Z)

HP and Microsoft Study Online Communities

Two articles in recent days highlighted social scientists at Hewlett-Packard (Joshua Tyler) and Microsoft (Marc Smith). Tyler, 25, studies the rhythms of composing and answering e-mails, work that he started while pursuing a master’s degree at Stanford. Smith has been studying USENET newsgroups, and his team at Microsoft has developed software called Netscan.    (8Y)

Mailing List Etiquette and Experimentation

I’ve been an e-mail user for over 10 years, and a mailing list and USENET user for just about as long, so I have strong beliefs about proper mailing list etiquette. That puts me in an interesting position as a participant in the Blue Oxen Collaboratories. On the one hand, these collaboratories are supposed to be shining examples of high-performance collaboration. On the other hand, they’re also supposed to be testbeds for experimentation and coevolution.    (8K)

Sometimes, people use the lists in ways that conflict with my inner sense of etiquette. However, etiquette is a form of social constraint, and if we forget why we find these constraints valuable, our sense of etiquette can impede collaboration.    (8L)

The collaboratories are a perfect place to have metadiscussions about this sort of behavior as it happens, but unfortunately, it’s hard for me to participate or initiate those discussions. Because of my position at Blue Oxen Associates, my commentary can be perceived as law rather than opinion, and I don’t want that to be the case.    (8M)

That, of course, is the reason for this blog — for me to WhineInPrivate in public. Which leads me to today’s topic.    (8N)

Cross-Posting    (8O)

The impetus for this entry was a discussion this morning with Andrius Kulikauskas, who called me all the way from Lithuania. Andrius founded the Minciu Sodas laboratory, which is similar in spirit to Blue Oxen Associates, and he actively participates in our collaboratories.    (8P)

One thing that Andrius does often is cross-post across different forums and mailing lists. I’m not a big fan of cross-posting, but I think there are times when it’s appropriate. The problems are:    (8Q)

  • List participation is often restricted to participants, largely as a way to avoid spam. As a result, people will not necessarily see responses to cross-posts, and it can result in stilted and confusing discussion.    (8R)
  • There is a force at work within effective communities: Know Your Audience. People behave differently depending on their audience, as well they should. You may know the audiences of the lists to which you are cross-posting, but the different members of each list may not, and that will affect how or whether they respond.    (8S)

Andrius and I discussed this a bit over the phone. One of his rationales for cross-posting is that it builds awareness of other communities, and that it encourages interconnectedness. I definitely believe in the former, but I’m not so sure about the latter. Simply knowing that a community exists certainly enables interconnectedness at some level, but cross-posting could also discourage interconnectedness. If there is no Shared Language on the different lists, cross-posts can unintentionally lead to further balkanization of communities.    (8T)

I’m not a big fan of cross-posting, but I tolerate it, and for good reason. When Andrius did it, it forced me to think very hard about why I disliked it. That helped me understand the reason for my feelings, and it also helped me think through some half-baked ideas. For example, when you consider the benefits of cross-posting, it’s clear that the technical balkanization of mailing lists caused by restricting participation to subscribers can be a very bad thing. One advantage with USENET newsgroups or web-based forums is that you don’t have this balkanization. On the other hand, these forums are susceptible to spam. I’m certain there is a technical solution to this problem, although I don’t know what it is yet.    (8U)

Off-Topic Postings    (8V)

A great example of where etiquette can be overly enforced is off-topic postings. I’ve noticed that on some lists, moderators are tyrannical about keeping discussion on-topic. While I understand the reasoning, being too extreme about this can do more harm than good. The nature of online spaces is different from face-to-face spaces, and the former is a more tolerant space for divergent ramblings than the latter. I wrote some thoughts on this in an earlier entry.    (8W)

Another reason for tolerating some off-topic discussions is the Water Cooler pattern. Communities are more effective when people have shared experiences, and informal socializing is a great way for identifying or creating these experiences. Several months ago, there was a discussion about local restaurants on the San Francisco Perl User Group mailing list. This clearly had no relevance to Perl, but it was very interesting, and no one complained. In fact, the original poster asked for recommendations off-list, and several people e-mailed him asking to post the responses on-list instead.    (8X)