标注的博客| 安全研究| 渗透测试| APT

首页

分析:采用深度学习模型对一个大的明文密码语料库进行分析。

作者 olanna 时间 2020-03-08
all

利用深度学习和NLP分析了一个大型的明文密码语料库。

目标:

免责声明:仅供研究之用。

在媒体上

获取数据

./count_total.sh BreachCompilation

开始(处理+深度学习)

处理数据并运行第一个深度学习模型:

# make sure to install the python deps first. Virtual env are recommended here. # virtualenv -p python3 venv3; source venv3/bin/activate; pip install -r requirements.txt # Remove "--max_num_files 100" to process the whole dataset (few hours and 50GB of free disk space are required.) ./process_and_train.sh <BreachCompilation path>

数据(解释)

INPUT: BreachCompilation/ BreachCompilation is organized as: - a/ - folder of emails starting with a - a/a - file of emails starting with aa - a/b - a/d - ... - z/ - ... - z/y - z/z OUTPUT: - BreachCompilationAnalysis/edit-distance/1.csv - BreachCompilationAnalysis/edit-distance/2.csv - BreachCompilationAnalysis/edit-distance/3.csv [...] > cat 1.csv 1 ||| samsung94 ||| [email protected] 1 ||| 040384alexej ||| 040384alexey 1 ||| HoiHalloDoeii14 ||| hoiHalloDoeii14 1 ||| hoiHalloDoeii14 ||| hoiHalloDoeii13 1 ||| hoiHalloDoeii13 ||| HoiHalloDoeii13 1 ||| 8znachnuu ||| 7znachnuu EXPLANATION: edit-distance/ contains the passwords pairs sorted by edit distances. 1.csv contains all pairs with edit distance = 1 (exactly one addition, substitution or deletion). 2.csv => edit distance = 2, and so on. - BreachCompilationAnalysis/reduce-passwords-on-similar-emails/99_per_user.json - BreachCompilationAnalysis/reduce-passwords-on-similar-emails/9j_per_user.json - BreachCompilationAnalysis/reduce-passwords-on-similar-emails/9a_per_user.json [...] > cat 96_per_user.json { "1.0": [ { "edit_distance": [ 0, 1 ], "email": "[email protected]", "password": [ "090698d", "090698D" ] }, { "edit_distance": [ 0, 1 ], "email": "[email protected]", "password": [ "5555555555q", "5555555555Q" ] } EXPLANATION: reduce-passwords-on-similar-emails/ contains files sorted by the first 2 letters of the email address. For example [email protected] will be located in 96_per_user.json Each file lists all the passwords grouped by user and by edit distance. For example, [email protected] had 2 passwords: 090698d and 090698D. The edit distance between them is 1. The edit_distance and the password arrays are of the same length, hence, a first 0 in the edit distance array. Those files are useful to model how users change passwords over time. We can't recover which one was the first password, but a shortest hamiltonian path algorithm is run to detect the most probably password ordering for a user. For example: hello => hello1 => [email protected] => [email protected] is the shortest path. We assume that users are lazy by nature and that they prefer to change their password by the lowest number of characters.

单独运行数据处理:

python3 run_data_processing.py --breach_compilation_folder <BreachCompilation path> --output_folder ~/BreachCompilationAnalysis

如果数据集对您来说太大,您可以将max_num_files设置为0到2000之间的值。

max_num_files