Building Anonymous and Consistent Datasets

Posted on February 15, 2011

Dave Currie Engineering Manager

Recently, Tuenti was approached with a proposal to do complex data analysis on some of our data. This presented a unique opportunity to get a different view and less obvious view of the data that we handle and move around every day, so we were definitely interested. You can see the first results of this work at The Beauty of Social Networks. However, there were some issues that needed addressing before we committed to working on this.

Continental Spain X-Ray

Our main concern was how to maintain users’ privacy, and the reason is obvious. As a responsible company, we are extremely careful with our users’ data, almost to the point of paranoia. Even though we were working with a trusted partner under contractual obligation to secrecy, we decided that before any data left our premises it needed to be made anonymous. We also considered transport of the data, to avoid any problems en-route.

Generating anonymous data involves removing any information that allows a reader to link a profile back to an actual person. The most obvious thing to do in these cases is to replace the user id with a different identifier, one that can’t be mapped back to the original. One way of doing this is by hashing the id with a salt that is unique for that user, so that even if one account were successfully traced, the method would not reveal a way to trace the remaining accounts. You would probably want to throw in a secret salt too, just for good measure. A better way of doing this is by simply generating random unique numbers and creating a mapping between the real ids and the new random ones.

Even with the identifier problem solved, there are other more subtle issues to be taken into account. One interesting factor we looked at involved the user relations map. To make advanced statistical analysis possible, the researchers needed to know when a relation was formed, in order to model growth of the map over time. However, timestamps in the data would allow potential attackers to pinpoint the approximate time when accounts were created, and would then be able to map a given identifier to a small group of real users on the site.

We decided that this was not an acceptable risk to take, so we needed to find a way to avoid including time-related data. After discussing the matter with the researchers, we eventually settled on substituting timestamps with an order field, which would reveal the order in which relations were created but without binding them to a specific time. This solution was satisfactory to everyone involved, even though it involved more work from us when preparing the dataset.

Finally, we had to actually get the data from our premises to the machines in which the data would be analyzed. We employed fairly standard but rigorous security measures including GPG public-key encryption, and secure transport either using encrypted network connections or transport over physical media. We actually used both of these options for different parts of the dataset.

Spain Full

These are just a few examples of measures we take to keep data safe. Obviously, a key factor is cleaning up after yourself – only working on machines that aren’t likely to be compromised, making sure to destroy temporary datasets, getting rid of things like the secrets or mapping tables and destroying our public and private keys after the data has been transported.

Our goal with this process was to ensure privacy even in the event that the generated dataset was somehow compromised and made public. It admittedly generated much more work for us, but in the end we think the effort was necessary to meet our own tight standards regarding security and privacy.

One Response to “Building Anonymous and Consistent Datasets”

  • Antonio Alcantara
    February 17, 2011 at 5:41 pm

    Please let me speak in Spanish:

    Solo quería comentar la “hipocresía” de decir que valorais la privacidad de los usuarios, y sin embargo, no se permite la navegacion segura (sobre HTTPS) en Tuenti.

    ¿Por qué para enviar los datos a los equipos encargados de realizar las estadísticas sí usais conexiones seguras, cuando los datos ya estaban “anonimizados”, y sin embargo, las conexiones con los usuarios son totalmente inseguras, exponiendo cookies de sesión, información del perfil e incluso mensajes instantáneos (fáciles de capturar en redes inalámbricas “públicas”)?
    ¿Hay alguna razón para ello?

    Les remito al último párrafo del post:
    “Nuestro objetivo con este proceso era asegurar la privacidad [...]. Nos generó mucho más trabajo, pero pensamos que el esfuerzo fue necesario para cumplir nuestros estándares respecto a seguridad y privacidad.”

    Considerenlo, por favor. Me gustaría poder usaar Tuenti desde cualquier sitio sin miedo a que me puedan interceptar.

    Un saludo.

Leave a Reply

  • (required)
  • will not be published (required)