利用深度学习和NLP分析了一个大型的明文密码语料库。
目标:
- 培养生成模型。
- 了解人们如何随着时间的推移更改密码:hello123->[email protected]>[email protected]!23岁。
免责声明:仅供研究之用。
在媒体上
- 在单个数据库中发现14亿个明文凭据
- 收集网上流传的14亿个明文泄露密码
- 1.4 BEEELLION凭证的明文存档,可在dark web存档中找到
- 福布斯
获取数据
- 下载任何Torrent客户端。
- 这是一个磁铁链接,你可以在Reddit上找到:磁铁:?xt=urn:btih:7ffbcd8cee06aba2ce656168cf68ce2addca0a3&dn=breakcompilation&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&tr=udp%3A%2F%2Ftracker.leechers paradise.org%3A6969&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2ftractorrents.pw%3A6969&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337
- 磁铁:?xt=urn:btih:7ffbcd8cee06aba2ce656168cf68ce2addca0a3&dn=breakcompilation&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&tr=udp%3A%2F%2Ftracker.leechers paradise.org%3A6969&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2ftractorrents.pw%3A6969&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337
- 此处提供校验和列表:checklist.chk
- /count_total.sh在breakcompilation中应该显示类似1400553870行的内容。
./count_total.sh
BreachCompilation
开始(处理+深度学习)
处理数据并运行第一个深度学习模型:
# make sure to install the python deps first. Virtual env are recommended here.
# virtualenv -p python3 venv3; source venv3/bin/activate; pip install -r requirements.txt
# Remove "--max_num_files 100" to process the whole dataset (few hours and 50GB of free disk space are required.)
./process_and_train.sh <BreachCompilation path>
数据(解释)
INPUT: BreachCompilation/
BreachCompilation is organized as:
- a/ - folder of emails starting with a
- a/a - file of emails starting with aa
- a/b
- a/d
- ...
- z/
- ...
- z/y
- z/z
OUTPUT: - BreachCompilationAnalysis/edit-distance/1.csv
- BreachCompilationAnalysis/edit-distance/2.csv
- BreachCompilationAnalysis/edit-distance/3.csv
[...]
> cat 1.csv
1 ||| samsung94 ||| [email protected]
1 ||| 040384alexej ||| 040384alexey
1 ||| HoiHalloDoeii14 ||| hoiHalloDoeii14
1 ||| hoiHalloDoeii14 ||| hoiHalloDoeii13
1 ||| hoiHalloDoeii13 ||| HoiHalloDoeii13
1 ||| 8znachnuu ||| 7znachnuu
EXPLANATION: edit-distance/ contains the passwords pairs sorted by edit distances.
1.csv contains all pairs with edit distance = 1 (exactly one addition, substitution or deletion).
2.csv => edit distance = 2, and so on.
- BreachCompilationAnalysis/reduce-passwords-on-similar-emails/99_per_user.json
- BreachCompilationAnalysis/reduce-passwords-on-similar-emails/9j_per_user.json
- BreachCompilationAnalysis/reduce-passwords-on-similar-emails/9a_per_user.json
[...]
> cat 96_per_user.json
{
"1.0": [
{
"edit_distance": [
0,
1
],
"email": "[email protected]",
"password": [
"090698d",
"090698D"
]
},
{
"edit_distance": [
0,
1
],
"email": "[email protected]",
"password": [
"5555555555q",
"5555555555Q"
]
}
EXPLANATION: reduce-passwords-on-similar-emails/ contains files sorted by the first 2 letters of
the email address. For example [email protected] will be located in 96_per_user.json
Each file lists all the passwords grouped by user and by edit distance.
For example, [email protected] had 2 passwords: 090698d and 090698D. The edit distance between them is 1.
The edit_distance and the password arrays are of the same length, hence, a first 0 in the edit distance array.
Those files are useful to model how users change passwords over time.
We can't recover which one was the first password, but a shortest hamiltonian path algorithm is run
to detect the most probably password ordering for a user. For example:
hello => hello1 => [email protected] => [email protected] is the shortest path.
We assume that users are lazy by nature and that they prefer to change their password by the lowest number
of characters.
单独运行数据处理:
python3 run_data_processing.py --breach_compilation_folder <BreachCompilation path> --output_folder ~/BreachCompilationAnalysis
如果数据集对您来说太大,您可以将max_num_files设置为0到2000之间的值。
max_num_files
- 确保有足够的可用内存(8GB应该足够)。
- 在Intel(R)Core(TM)i7-6900K CPU上以3.20GHz(单线程)运行需要1小时30分钟。
- 未压缩输出约为45G。