Skip to main content

View Diary: Data Storage and the NSA (121 comments)

Comment Preferences

  •  Machines read them (6+ / 0-)

    Machines are quite capable of reading words, deriving grammar, and determining the semantic meaning of sentences, paragraphs, or the e-mail or document as a whole.

    This goes back almost 25 years now, but a company I worked for back then had a whole team of linguists and speech experts that devised algorithms to break down "written" data whether structured (from known fields on a form, for example) or unstructured (free form, like these comment blocks or diaries), and be able to derive slang, grammar, roots, and semantic meaning.  This was done to automate translation and be more "intelligent" in understanding search requests, and the like.

    The other thing to keep in mind is that despite the number of words in just the English language, for example, we use only a small fraction of them commonly. This is why you see very compression results when ZIPing text. Even though there is a magnitude of order larger number of proper nouns than true words, in English for example, those words tend to trend with names of businesses, locations, and people often being discussed based on current events as just one instance. If we know even one other piece of information such as the sender or recipient or their respective locations the scope or universe of proper nouns becomes smaller.

    Dialects, colloquialisms, and slang are known and can be codified for meaning (i.e. translated as if you used "proper" English instead). So again, if you know the identity or location you can improve the quality of understanding.

    Of course, the same can be done with voice processing. The differences between the way people speak and the way they write are known. Though voice processing is much more complex due to accents and impediments for example, or even more likely, regional or local "speak".

    Nothing is perfect so while most communications can be processed automatically with little misunderstanding there is likely a significant number of communications that would be kicked to crypto-analysis to try to determine if a code is being used. This process would still be highly automated where it would be looking not only at meanings but relationships to known events, past, present, and future as well as other attributes of those events. It may look at movie or tv references, for example, are you talking about an episode of "24" or something else.

    Naturally if you have any known identities the communication can be combined with other information such as shopper card information or credit/debit purchase information to help identify if you are using a code. For example, if you were to frequently say, "I have to go to the store to get milk" in your communications that might be a code despite that being a very common everyday thing. When was the last time you purchased milk? What quantity did you buy, etc.?

    All of this can happen in automated systems before it  gets near a human analyst, if ever.

Subscribe or Donate to support Daily Kos.

Click here for the mobile view of the site