The purpose of this post is to provide a general overview of how to responsibly derive anonymized trip data, suitable for publication on an open data portal, from an MDS Provider feed, while protecting rider privacy, using an example of the approach from Louisville Kentucky.
Overview
The Stae data importer along with our pre-built vendor integrations make it very easy to pull MDS Trip and Status Change data into your Stae account. However, these raw data are not appropriate for publication on an open data portal, as they represent a risk to rider privacy. MDS data contain a latitude and longitude for each scooter, which make it possible to potentially identify a rider if a trip is particularly unique. To solve this issue, some cities use a method called k-anonymization to protect rider privacy, while still retaining useful information about trip patterns. The algorithm enforces a minimum amount of trips (k) which have started and/or ended from exact locations. If the number of trips originating from that point is smaller, the algorithm will "fuzz" the data by randomizing the latitude and longitude within a relative radius to the original location. You can read more about this method and its application to MDS here - https://medium.com/sharedstreets/aggregating-trip-data-using-k-anonymization-727d5a6413f3.
There are a few different implementations of k anonymization that have been published both by cities and private companies. The Innovation Department in Louisville, Kentucky provides an interesting deep dive regarding their point of view and methodology they use for implementing data anomyzation here. (https://github.com/louisvillemetro-innovation/dockless-open-data). Our example workflow and code below uses their implementation of the k-anonymization algorithm.
Data Import
If you do not already have your MDS Trip data imported into the platform you can start by creating a new Source and selecting the correct vendor and filling in your credentials. Once this data is in the system you can retrieve the source id and source key from the source export page as you will need these to download the data into the processing script.
Data Processing
The following script can be used to download the MDS Trip data directly from a source or multiple sources, deploy the anonymization algorithm, and output a json file of the processed data which can be uploaded to a new source within your Stae account. In order for the script to function you must supply at least one source id and source key at the bottom.
We recommend processing at least six months worth of trip data if at all possible. The script provides a set of parameters that contain as a default the values used in Louisville, but can be edited to meet the specific needs of your jurisdiction.
These include:
K_ANON_VAL
The minimum number of trips allowed at any origin/destination before location data are randomized.
FUZZED_TIME_BUCKET
Rounds the start and end times to the nearest interval specified here to prevent matching trips to exact times individuals may have departed or arrived.
FUZZED_LOCATION_PRECISION
Rounds the GPS coordinates to this precision to obscure exact location of the trip.
RANDOM_LATITUDE_RADIUS / RANDOM_LONGITUDE_RADIUS
If k-anonymization is required for a specific trip this is the radius in which the random location is generated.
const moment = require('moment'); const stae = require('stae'); const _ = require('lodash'); const turf = require('@turf/turf'); const fs = require('fs'); const UUID = require("pure-uuid");
// The minimum number of trips allowed at any origin/destination before location data is randomized. const K_ANON_VAL = 4;
// Rounds the start and end times to the nearest interval specified here to prevent matching trips to exact times individuals may have departed or arrived. const FUZZED_TIME_BUCKET = 15;
// Rounds the GPS coordinates to this precision to obscure exact location of the trip. const FUZZED_LOCATION_PRECISION = 3;
// If k-anonymization is required for a specific trip this is the radius in which the random location is generated.
const RANDOM_LATITUDE_RADIUS = 0.004; const RANDOM_LONGITUDE_RADIUS = 0.005; const retrieve = async (api, sourceId, limit, offset, filters) => { let response try { response = await api.source.datum.find({ sourceId: sourceId, options: { offset: offset, limit: limit, filters: filters } }) } catch(e) { console.log(e) console.log("Retrying...") response = await retrieve(api, sourceId, limit, offset, filters) } return response } const retrieveTrips = async (sourceId, key, startedAt, endedAt) => { const api = stae.createClient({ key: key }) const limit = 5000; let filters = {}; if (startedAt) { filters['data.startedAt'] = { $gt: startedAt } } if (endedAt) { filters['data.endedAt'] = { $lt: endedAt } } let offset = 0; let data = []; while(true) { console.log(`Source : ${sourceId}`) let response = await retrieve(api, sourceId, limit, offset, filters) console.log(`Results: ${response.meta.results}`) if (response.meta.results === 0) { break } offset = offset + response.meta.results; data.push(response.results.map(trip => buildFuzzedTrip(trip))) } data = _.flatten(data); return data } const generateTripUUID = (trip) => { const uuid = new UUID(5, "ns:URL", trip.data.id); return uuid } const generateFuzzedLocations = (trip) => { const coordinates = trip.data.path.coordinates; const startLocation = coordinates[0].map(c => c.toFixed(FUZZED_LOCATION_PRECISION)) const endLocation = coordinates[coordinates.length - 1].map(c => c.toFixed(FUZZED_LOCATION_PRECISION)) return { start: startLocation, end: endLocation } } const fuzzTimes = (minute, hour) => { const fuzzedMinute = (((minute + (FUZZED_TIME_BUCKET / 2)) / FUZZED_TIME_BUCKET | 0) * FUZZED_TIME_BUCKET) % 60;; const fuzzedHour = ((((minute / 105) + .5) | 0) + hour) % 24; return [fuzzedHour, fuzzedMinute] } const calculateDistance = (trip) => { return trip.data.path.coordinates.reduce((totalForTrip, point, i, points) => { let nextPoint = points[i + 1]; let pointDistance = 0; if (nextPoint) { pointDistance = turf.distance(turf.point(point), turf.point(nextPoint), { units: 'miles' }); } return totalForTrip + pointDistance }, 0) } const generateFuzzedDatetimes = (trip) => { const startDate = moment(trip.data.startedAt); const endDate = moment(trip.data.endedAt); const fuzzedStart = fuzzTimes(startDate.minute(), startDate.hour()); const fuzzedEnd = fuzzTimes(endDate.minute(), endDate.hour()); const startTime = moment(startDate).minute(fuzzedStart[1]).hour(fuzzedStart[0]); const endTime = moment(endDate).minute(fuzzedEnd[1]).hour(fuzzedEnd[0]); const tripDuration = (startDate - endDate); return { startDate: startDate.format('Y-MM-DD'), endDate: endDate.format('Y-MM-DD'), startTime: startTime.format('HH:mm'), endTime: endTime.format('HH:mm'), tripDuration: parseInt(Math.abs(tripDuration / 1000 / 60)), dayOfWeek: startDate.format('dddd'), hourNum: startDate.format('H') } } const generateFuzzedTripLocationKey = (fuzzedTrip) => { return `${fuzzedTrip.startLatitude},${fuzzedTrip.startLongitude},${fuzzedTrip.endLatitude},${fuzzedTrip.endLongitude}` } const generateRandomLocations = (fuzzedTrip) => { const randomNumberA = Math.random() const randomNumberB = Math.random() randomStartLongitude = parseFloat(fuzzedTrip.startLongitude) + (randomNumberB * RANDOM_LONGITUDE_RADIUS * Math.sin( 2 * Math.PI * randomNumberA / randomNumberB )); randomEndLongitude = parseFloat(fuzzedTrip.endLongitude) + (randomNumberB * RANDOM_LONGITUDE_RADIUS * Math.sin( 2 * Math.PI * randomNumberA / randomNumberB )); randomStartLatitude = parseFloat(fuzzedTrip.startLatitude) + (randomNumberB * RANDOM_LATITUDE_RADIUS * Math.cos( 2 * Math.PI * randomNumberA / randomNumberB )); randomEndLatitude = parseFloat(fuzzedTrip.endLatitude) + (randomNumberB * RANDOM_LATITUDE_RADIUS * Math.cos( 2 * Math.PI * randomNumberA / randomNumberB )); return { startLongitude: randomStartLongitude.toFixed(FUZZED_LOCATION_PRECISION), startLatitude: randomStartLatitude.toFixed(FUZZED_LOCATION_PRECISION), endLatitude: randomEndLatitude.toFixed(FUZZED_LOCATION_PRECISION), endLongitude: randomEndLongitude.toFixed(FUZZED_LOCATION_PRECISION) } } const buildFuzzedTrip = (trip) => { const uuid = generateTripUUID(trip); const locations = generateFuzzedLocations(trip); const datetimes = generateFuzzedDatetimes(trip); const distance = calculateDistance(trip); const fuzzedTrip = { uuid: uuid, startLatitude: locations.start[1], startLongitude: locations.start[0], endLatitude: locations.end[1], endLongitude: locations.end[0], startDate: datetimes.startDate, endDate: datetimes.endDate, startTime: datetimes.startTime, endTime: datetimes.endTime, tripDuration: datetimes.tripDuration, tripDistance: distance, dayOfWeek: datetimes.dayOfWeek, hourNum: datetimes.hourNum } return fuzzedTrip } (async (tripSources) => { const startedAt = null; const endedAt = null; const tripCounts = {}; let fuzzedTrips = await Promise.all(tripSources.map(async (source) => { return retrieveTrips(source.id, source.key, startedAt, endedAt) })); fuzzedTrips = _.flatten(fuzzedTrips); fuzzedTrips.forEach((fuzzedTrip) => { const tripLocationKey = generateFuzzedTripLocationKey(fuzzedTrip) if (!tripCounts[tripLocationKey]) { tripCounts[tripLocationKey] = 0 } tripCounts[tripLocationKey]++ return fuzzedTrip }) fuzzedTrips = fuzzedTrips.map(fuzzedTrip => { const tripLocationKey = generateFuzzedTripLocationKey(fuzzedTrip) const tripCount = tripCounts[tripLocationKey] if (tripCount <= K_ANON_VAL) { randomLocations = generateRandomLocations(fuzzedTrip) fuzzedTrip.startLongitude = randomLocations.startLongitude; fuzzedTrip.startLatitude = randomLocations.startLatitude; fuzzedTrip.endLongitude = randomLocations.endLongitude; fuzzedTrip.endLatitude = randomLocations.endLatitude; } return fuzzedTrip }) try { fs.truncateSync('./fuzzed-trips.json') } catch {} fs.writeFileSync('./fuzzed-trips.json', JSON.stringify(fuzzedTrips)); })([ // Provide multiple sources if you have multiple vendors. { id: // SourceID, key: // SourceKey } ])
Wrap Up
Once this anonymized data is imported back into your Stae account you can set the source to public to publish this data directly to your Open Data Hub or use the source export URL to automate the upload of this data directly to your existing Open Data Portal.
We have provided a data type (https://municipal.systems/types/custom-fuzzed-trip) which matches the output of this script to make it easy to create a new source for this data, but you can also create your own custom type to include any additional data you would like.
If you would like to discuss feel free to get in touch or if you have any questions about this specific article feel free to leave feedback in the comments below.
Comments
0 comments
Please sign in to leave a comment.