Skip to main content

View Diary: Data Storage and the NSA (121 comments)

Comment Preferences

  •  Couple of things (6+ / 0-)

    From a quick Google search, the average phone call is 3:15 minutes.  Lets call it 4 minutes.  Use something like SPEEX to compress and its less than 5 MB/call if I did the math right.  

    Secondly for storage, you have to remember that contractors built this.  Not IT geeks.  I am guessing that they are useing EMC-like SAN systems.  These are pretty spendy and the contractor markup will be that much more.  If designed by IT geeks, it would probably be like the Backblaze pod that is probably using ZFS to manage the file system.  Connect them to an 10GB iSCSI network with a few database servers at the front end for network/pattern analysis that various spooks run searches against.  

    Having two data centers probably means that the data is mirrored between them for redundancy.  There probably is some super secret fiber running between them.  

    My guess is that the NSA is mostly interested in getting metadata and encrypted communications.  The metadata is useful for determining human relationships by mapping who calls who.  The number you call every mother's day is probably your mother.  You will call you partner often. Information on towers used by the call will tell where you are and can provide all sorts of information.  

    Most calls are probably recorded as meta data only and only a comparative few (most likely originating from numbers associated with us DFHs) get the full STASI treatment.  The metadata is small and easy to transfer, probably at most 2KB/day for most people.  

    I am also guessing that this database is used mostly to work backwards because its an awful lot of data to analyze.  For instance, there is a suicide bombing.   The NSA gets the IMEI (basically serial number) of every phone nearby and then starts to figure out who owns each one, including the ones that were destroyed by the explosion.  From there, they could figure out who the suicide bomber called and start proving information to whatever Three Letter Agencies its working with.  

    It almost sounds like its much more useful against non violent protesters than suicide bombers.  

    I'm a 4 Freedoms Democrat.

    by DavidMS on Mon Jul 01, 2013 at 08:02:21 PM PDT

    •  All your points are probably more accurate (3+ / 0-)
      Recommended by:
      ColoTim, StrayCat, Cliss

      than what I wrote here. What this diary is about is explaining how it could be done in terms non-technical people can understand. You are totally correct on the phone data. My examples on texts, tweets and email are very high also - 40% to 200% higher than reality. I used MP3 as my example because people can relate to MP3. I over inflated the time from 4 minutes to an hour to show that 1 billion calls a day data wise is really small in the grand scheme of things. MP3 has a 10:1 compression and most audio compression is at 26:1 these days. My point still remains, if we think the disc space (old school term) / data storage (new term) is the limiting factor, we'd be wrong.

      I was looking at this EMC storage system and instinctively knew this is what is used, but I didn't include it in this diary. It is too complex to explain and i don't know how much it costs.

      The metadata yields a lot of great information that can be used in predictive modeling algorithms. Again, it just didn't fit in this diary that is long enough already.

      My husband starting to explain how he codes databases, tables far more than I could absorb and quickly realized that for my purposes here, I couldn't do the subject justice in a few paragraphs.

      I decided to focus on how operationally the data collection could occur in this diary. Maybe in the future I can write something about how the software would work for indexing and accessing the data.

      If a nation expects to be ignorant and free, in a state of civilization, it expects what never has and never will be. Thomas Jefferson

      by JDWolverton on Mon Jul 01, 2013 at 08:28:11 PM PDT

      [ Parent ]

      •  However voice compression is something like (1+ / 0-)
        Recommended by:

        4kbps per second (or half a kilobyte) even with less efficient free and patent unencumbered codecs.  Want to bet that something like Audible's codec can get it down to 3kbps or even 2kbps?

        You have watched Faux News, now lose 2d10 SAN.

        by Throw The Bums Out on Tue Jul 02, 2013 at 05:32:10 AM PDT

        [ Parent ]

        •  It's been 20+ years since I worked with audio data (0+ / 0-)

          I chose to use files that people can easily relate to. The last sound engineer I talked to was so excited about how he was using 26:1 compression without losing audio integrity. I just barely understand lossy and lossless audio data compression, but I'm sure you're right. This area is progressing along with the rest of computerization technology at a very fast pace.

          If a nation expects to be ignorant and free, in a state of civilization, it expects what never has and never will be. Thomas Jefferson

          by JDWolverton on Tue Jul 02, 2013 at 06:41:07 AM PDT

          [ Parent ]

    •  However keep in mind that Speex is not (1+ / 0-)
      Recommended by:

      quite as good due to patent issues.  Something like Audible's would reduce the data size even further.

      You have watched Faux News, now lose 2d10 SAN.

      by Throw The Bums Out on Tue Jul 02, 2013 at 05:30:53 AM PDT

      [ Parent ]

Subscribe or Donate to support Daily Kos.

  • Recommended (160)
  • Community (83)
  • Baltimore (80)
  • Freddie Gray (59)
  • Bernie Sanders (58)
  • Civil Rights (51)
  • Elections (40)
  • Culture (36)
  • Hillary Clinton (33)
  • Media (33)
  • 2016 (29)
  • Racism (29)
  • Law (29)
  • Education (25)
  • Labor (25)
  • Environment (24)
  • Politics (23)
  • Republicans (23)
  • Barack Obama (21)
  • Police (19)
  • Click here for the mobile view of the site