We can do password guessing way better than brute force by generating strings us...

gopher · on Oct 31, 2008

I've created wordlists from Wikipedia database dumps some time ago (http://benjamin-schweizer.de/files/wordlist-wikipedia/); they are pretty large and thus, useful for dictionary attacks. The wordlists are sorted, common words are on top of the lists.

I think that there is a typical password length, so you could improve the sorting based upon a multi-dimensional rating scheme. I'd use expected password length and commonness of a word as factors. Mixing these real words with computer generated words might speed up brute force attacks.

However, I'm not sure how to integrate ordered wordfiles with rainbow tables. Any ideas?

jcl · on Oct 31, 2008

I don't think word frequency is a good estimator for the likeliness of passwords. Many frequently used words -- like connectors or adverbs -- are unlikely to be used as passwords. I expect proper names (of people, places, or cultural works) are the most common passwords, which are at a relative disadvantage in word frequency lists.

randomwalker · on Oct 31, 2008

rainbow tables are an implementation of the time-space tradeoff concept: you are trying to search through a space so large you cannot enumerate it. if you have already enumerated it, as in a wordlist, it is not meaningful to use rainbow tables. it's not a question of how; it's not even a well-defined operation.

that's great that you've made that list though.. i wanted word frequency tables for my startup which is an entirely unrelated type of project. if i hadn't found this i would have compiled it myself; thanks much :)

while your list doesn't have frequencies, i guess i can use the position in the list as a proxy for frequencies. but it's not optimal. any chance you can put up a list which also has the counts?

gopher · on Oct 31, 2008

I don't have a current dump, but if you send me an email, I can give you the script I've written to create those dumps. It prints out the counts.

linhir · on Oct 31, 2008

If you had a good training set of a million or so actually used passwords, you could use some machine learning techniques to make this go faster when coupled with the insights you use from natural language processing. Tragically, I doubt people would contribute their old passwords to a data set just to prove this.