I want to point out that the whole topic of "metadata" is one of the biggest fig leaves the NSA is hiding behind in this whole "discussion" we are having. I was reminded of it again reading the excellent diary about the NSA on PBS
The NSA hack lawyer says:
We do collect metadata, information about communications, more broadly than we collect the actual content of communications, but that's because it is less intrusive than collecting content and in fact can provide us information that helps us more narrowly focus our collection of content on appropriate foreign intelligence targets.
This is nothing but a bold attempt to take the absolute truth of the Snowden disclosure and pretend its no big deal. They've seized on the term "metadata" as if its some innocuous part of the information chain that isn't the really "good stuff" we should worry about. "Sure we collect plain old metadata, but who really cares about metadata?" has been the mantra from the moment the news came out. How many times have you heard anyone speak on this issue without saying "metadata" over and over? Remember the first press conference with
Diane Feinstein and Saxby Chambliss stating that "As you know, this is just metadata. There is no content involved."?
This is all one giant misdirection and many of us fall for it. I've read and heard countless comments from people willing to except "metadata" collection as long as it doesnt include the actual content of communication. Well here's a bit of news ... metadata is nothing more than what the designer of the systems says it is, and it can and does include everything, including the full content of the communication.
How do I know this? Because I've worked most of career selling software that provides various capabilities to process and analyze "unstructured data". That phrase refers to any information that, unlike a spreadsheet or a database, doesn't exist in an easily accessible and organized format. Things like Word documents, Powerpoint slides, emails, or the machine generated transcript of a voice recording. The technologies I've worked with all in one way or another "normalize" this data to make it understandable for machine processing by doing things like:
- extract the text from any file format
- voice recognize and convert speech to text
- eliminate noise by cutting out "stop words" like "the"
- analyze frequency, position and syntax to "weight" important terms
- provide "sentiment analysis" to understand the tone of a message from its
- extract meaning from text by reading all the words
- mathematically analyze the meaning of every word in a document
- locate every document that is "similar" to a model document based on meaning
- isolate "entities" from unstructured like name, place, address or any kind of fact/noun
Lots of technologies like this exist and every one I have been associated with has been sold directly to one or more of the "letter agencies" (NSA, CIA, FBI, GCHQ, MOD, etc) or indirectly via a Booz Allen or SAIC or similar intermediary.
ALL of those capabilities and some I don't know about are being employed to the NSA data collection. These are enormous haystacks of data and they need an incredible array of tools to sift out the needles they may be looking for. Everything that any of these programs generate is considered "metadata". This includes the body text of the message. This is so simple that it amazes me the NSA defenders are hiding behind it, but I think they were unable to deny the truth so they went with the argument that what they are collecting doesnt matter. And many of the people arguing about this point have bought into the lie pretty well