< Data Platform < Data Lake < Traffic < Pageview hourly
Data Platform/Data Lake/Traffic/Pageview hourly/Fingerprinting Over Time
In a plan to secure our user data, the pageview_hourly dataset needs to be sanitized in such a way that it does not allow to track user path. See this page for a broad view on the pageview_hourly sanitization project.
This page go through as fingerprinting analysis over time, using webrequest and pageview_hourly datasets.
Data Preparation
CREATE TABLE hashes1 STORED AS PARQUET AS
SELECT
hash(
pageview_info['project'],
pageview_info['language_variant'],
access_method,
x_analytics_map['zero'],
agent_type,
geocoded_data['continent'],
geocoded_data['country_code'],
geocoded_data['country'],
geocoded_data['subdivision'],
geocoded_data['city'],
user_agent_map) as h,
COUNT(DISTINCT ip) as dip,
COUNT(DISTINCT pageview_info['page_title']) AS dpv,
COUNT(1) as c FROM wmf.webrequest
WHERE webrequest_source = 'text'
AND year = 2016
AND month = 1
AND day = 11
AND agent_type = 'user'
GROUP BY
hash(
pageview_info['project'],
pageview_info['language_variant'],
access_method,
x_analytics_map['zero'],
agent_type,
geocoded_data['continent'],
geocoded_data['country_code'],
geocoded_data['country'],
geocoded_data['subdivision'],
geocoded_data['city'],
user_agent_map);
WITH pv_hashes1 AS
(
SELECT
year, month, day,
hash(project, language_variant, access_method, zero_carrier, agent_type,
continent, country_code, country, subdivision, city, user_agent_map) as pvh,
SUM(view_count) as vc
FROM wmf.pageview_hourly
WHERE ((year = 2015 AND month > 5) OR (year = 2016 AND month = 1 AND day < 12))
AND agent_type = 'user'
GROUP BY
year,
month,
day,
hash(project, language_variant, access_method, zero_carrier, agent_type,
continent, country_code, country, subdivision, city, user_agent_map)
)
SELECT
year, month, day,
SUM(CASE WHEN dip = 1 THEN 1 ELSE 0 END) as dip1_n,
SUM(CASE WHEN dip = 2 THEN 1 ELSE 0 END) as dip2_n,
SUM(CASE WHEN dip = 3 THEN 1 ELSE 0 END) as dip3_n,
SUM(CASE WHEN dip = 4 THEN 1 ELSE 0 END) as dip4_n,
SUM(CASE WHEN dip = 5 THEN 1 ELSE 0 END) as dip5_n,
SUM(CASE WHEN dip > 5 THEN 1 ELSE 0 END) as dipPlus_n,
SUM(CASE WHEN dip = 1 THEN pvh1.vc ELSE 0 END) as dip1_vc,
SUM(CASE WHEN dip = 2 THEN pvh1.vc ELSE 0 END) as dip2_vc,
SUM(CASE WHEN dip = 3 THEN pvh1.vc ELSE 0 END) as dip3_vc,
SUM(CASE WHEN dip = 4 THEN pvh1.vc ELSE 0 END) as dip4_vc,
SUM(CASE WHEN dip = 5 THEN pvh1.vc ELSE 0 END) as dip5_vc,
SUM(CASE WHEN dip > 5 THEN pvh1.vc ELSE 0 END) as dipPlus_vc
FROM pv_hashes1 pvh1
INNER JOIN pv_san.hashes1 h1
ON (pvh1.pvh = h1.h)
GROUP BY
year, month, day
ORDER BY year, month, day
LIMIT 1000;
Results

As shown below, the decrease of number of hashes match is almost linear. Nonetheless the remaining number of hashes from December still existing in June is very substantial (over a million distinct hashes), particularly with hashes involving only one IP, the most dangerous bucket in term of browsing pattern reconstruction.
This article is issued from Wikimedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.