I found a data-set of password(s) on DataScienceCentral: Password and hijacked email dataset for you to test your data science skills – And for fun, I played with the data-set for an hour or so:
1) Password Length vs Frequency
2) Percentage of passwords having at least one special character vs passwords having no special character:
3) Percentage of passwords that have: at-least one number, one alphabet & one special character AND length = 8 or more.
Let’s see a comparison of Passwords of length 8 or more (69.302%) vs Passwords of length 8 or more having combination of alphabets & numbers & special characters (1.485%)
That’s about it for now – it was fun!
And for those interested, here are the few behind the scene technical details:
Tools I used:
1. Excel & 2. SQL Server
Note: I first tried using Google refine to augment data – but it crashed on me. So thought of using SQL Server and TSQL. And if excel 2010 supported 2+ million then I would not have needed SQL server. Anyhow – the tool used is not important here.
2 million passwords in a .txt file.
Information I appended to the data-set using TSQL:
1. Length of password
2. Has Alphabets?
3. Has Numbers?
4. Has special Characters?
Plus few others derived from #2, #3 & #4 like ” has alphabets+ characters + special characters?”
That’s about it for the technical details. Ping me if interested!
- Where can we find datasets that we can play with for Business Intelligence, Data Mining, Data Analysis Projects? (parasdoshi.com)
- The top 10 passwords from the Yahoo hack: Is yours one of them? (zdnet.com)