Skip to main content

I'm interested in natural language processing.

Where could I find a very large collection of natural language text ("corpus" in NLP speak)?

Hmm.  Oh, right.  Right here.

What kind of things can you do with such a corpus?

Lots!  But for just a first exercise, I do really love the "text generators" based on word and phrase frequencies.

So, I counted all the bigrams in Daily Kos diaries published in 2012 that were used 10 or more times.

A bigram is just a two word phrase, like, for instance, "Daily Kos".

Once you know often a word is likely to occur in a Daily Kos diary, and how often that word is followed by other words, you can generate "Daily Kos Text".

My two personal favorites.
- Media attention deficit.
- House oversight hearing aid kit.

Fun!  Jump on over for more.

The fun is otherwise known as

"A second order Markov approximation to Daily Kos Diaries, 2012"

I decide how many words of dkos like text I want.
Pick a first word randomly, based on it's relative frequency.
I call "next word" that many times and stash the result in a file.
I manually prune (some lines are gibberish) and embellish (lightly).
Here's the "next word" method.
Assume the word given is a first word (x).

Try to pick a bigram second word (y) randomly, based on its relative frequency following x.
If that works, return the word.

If no words are known to follow x in a bigram, pick a new first word (x) as before, and return an end of sentence flag and the word.

Here are some sample sentences from a run of 2000 words.  The culling and selection process is around 90%; the remaining examples contain only 223 words.  Most of them sound more like headlines than real sentences, probably due to the lack of connecting words like of, and, by, in, etc.  Creative addition of punctuation and occasional connecting words by me.

More favorites
- Mitt didn't possess offshore holdings?
- It's November, President: think gay civil disobedience.
- Online learning company sells exotic batik fabrics.
- Pakistan drone fleet.
- Passenger trains running surpluses, Ayn.
- Registered here. kos elections director chuck berry.
- Right solution has reelected obama.
- Romney truly loves oddball projects involving search warrants.
- Rove's crossroads spent billions.
- Rush didn't send horace boothroyd iii a social reform law barring battery park johnny longtorso.
- Santorum lost badly;  wrong right leaders continue unabated.
- Shy person has offered advice tips to quilters.
- Stonewall democrats? Republican senate procedure.  Incredible lies and distortions.

The remaining top drawer items follow.  I have more, some make sense but are a little boring, some don't make any sense.  

-- Applause lines drawn criticism.
- Code changes!  I've gotten caught napping.
- Commuter rail systems work!
- Diary linked article focuses founder markos community fundraiser.
- Energy transportation and space administration working group funded research firm and hired guns for gun battles.
- Here's david plouffe!
- I've helped create conditions such that Romney held desserts, cookies, and maudlin homemade bombs.
- Obama walked slowly, turning point:  bump.
- Oh crap thats obviously, thats life, blood red rocks, damascus explosion kills, people survive.
- On Sixth avenue, downtown los cabos, Christine Todd offered legitimate criticism of nobel prize winner paul tsongas.
- Parts manufacturers association served cocktailparty style points.
- President Goodluck Jonathan Bernstein.
- Rush transcript courtesy kos.
- Sen candidate Chris Mathews art paintings, featuring beautiful fine piece called romnesia.
- Strange things: right hates gays, lesbians.
- Trees. Urbanscapes. Dry weather channel radio talk stations. radio blue sky.  
- High risk factor: driving drunk, driving.
- There's gotta be a "run wild" card.
- Wiki article mentions tornado outbreak.
- Thing we've grown up with - angus king jr president
- Stayathome mom confirmed tornado emergency medicine man worth examining.
- Republicans think "hey folks, running ads attacking people set standards!"
- Weaken medicare, cut extended benefit guaranty, corporation alec agenda.
- Popular social inequity.
- Legislation extending unemployment statistics regarding haiti earthquake.
- Lawrence berkeley main threat reduction program cuts romney spokeswoman andrea grimes.
- Hearings held: captive audience.
- Friday diary rescue, daily bucket check.
- Carbon emissions trading post displaying austrian school lunches.

Anybody make it down this far?  Congratulations!  You get a bonus.

In 2012, almost nobody writes "same sex".  Nearly universally, this is written as one word "samesex".  This surprised me!  

If anybody else wants to play, let me know how many words you want, and I'll send you the raw text of a run with that many words.

That is all.  Have fun.

Your Email has been sent.
You must add at least one tag to this diary before publishing it.

Add keywords that describe this diary. Separate multiple keywords with commas.
Tagging tips - Search For Tags - Browse For Tags


More Tagging tips:

A tag is a way to search for this diary. If someone is searching for "Barack Obama," is this a diary they'd be trying to find?

Use a person's full name, without any title. Senator Obama may become President Obama, and Michelle Obama might run for office.

If your diary covers an election or elected official, use election tags, which are generally the state abbreviation followed by the office. CA-01 is the first district House seat. CA-Sen covers both senate races. NY-GOV covers the New York governor's race.

Tags do not compound: that is, "education reform" is a completely different tag from "education". A tag like "reform" alone is probably not meaningful.

Consider if one or more of these tags fits your diary: Civil Rights, Community, Congress, Culture, Economy, Education, Elections, Energy, Environment, Health Care, International, Labor, Law, Media, Meta, National Security, Science, Transportation, or White House. If your diary is specific to a state, consider adding the state (California, Texas, etc). Keep in mind, though, that there are many wonderful and important diaries that don't fit in any of these tags. Don't worry if yours doesn't.

You can add a private note to this diary when hotlisting it:
Are you sure you want to remove this diary from your hotlist?
Are you sure you want to remove your recommendation? You can only recommend a diary once, so you will not be able to re-recommend it afterwards.
Rescue this diary, and add a note:
Are you sure you want to remove this diary from Rescue?
Choose where to republish this diary. The diary will be added to the queue for that group. Publish it from the queue to make it appear.

You must be a member of a group to use this feature.

Add a quick update to your diary without changing the diary itself:
Are you sure you want to remove this diary?
(The diary will be removed from the site and returned to your drafts for further editing.)
(The diary will be removed.)
Are you sure you want to save these changes to the published diary?

Comment Preferences

Subscribe or Donate to support Daily Kos.

Click here for the mobile view of the site