Visualizing dataset of 2 million+ passwords:

Standard

I found a data-set of password(s) on DataScienceCentral: Password and hijacked email dataset for you to test your data science skills – And for fun, I played with the data-set for an hour or so:

1) Password Length vs Frequency

1 how to choose password password length

2) Percentage of passwords having at least one special character vs passwords having no special character:

2 passwords that have special character vs the one's that dont

3) Percentage of passwords that have: at-least one number, one alphabet & one special character AND length = 8 or more.

Answer: 1.4856%

Let’s see a comparison of Passwords of length 8 or more (69.302%) vs Passwords of length 8 or more having combination of alphabets & numbers & special characters (1.485%)

4 passwords having combination of alphabets plus numbers and special characters

That’s about it for now – it was fun!

 

And for those interested, here are the few behind the scene technical details:

Tools I used:

1. Excel & 2. SQL Server

Note: I first tried using Google refine to augment data – but it crashed on me. So thought of using SQL Server and TSQL. And if excel 2010 supported 2+ million then I would not have needed SQL server. Anyhow – the tool used is not important here.

Initial state:

2 million passwords in a .txt file.

Information I appended to the data-set using TSQL:

1. Length of password

2. Has Alphabets?

[a-zA-Z]

3. Has Numbers?

[0-9]

4. Has special Characters?

[^a-zA-Z0-9]

Plus few others derived from #2, #3 & #4 like ” has alphabets+ characters + special characters?”

That’s about it for the technical details. Ping me if interested!

 

Advertisements