This repo contains all the relevant sourced related to our ACSAC 2018 paper: LoopMC: Using Loops For Malware Classification Resilient to Feature-unaware Perturbations
.
The folder utils
contains individual utils related to our work.
We provided a standalone version of our semantic labeling technique, which will give a semantic label from method name
, class
and, method signature
.
How to use it:
cd utils
# ipython
In [1]: import semantic_labeling as sl
In [2]: sl.get_method_label_by_name('equals', 'Ljava/lang/Object;', 'equals(Ljava/lang/Object;)Z')
Out[2]: 'ObjectComparision'
You can import the semantic_labeling
module into your projects and use it as illustrated above.
We created a virtual box VM of Ubuntu 16.04, that contains the required dependencies along with all the sources.
- Make sure you have VirtualBox installed on your system.
- Download the VM (in OVA format) from here.
- Import the Downloaded OVA into VirtualBox:
File
->Import Appliance
->specify the file path to the .ova file
. - You can login using user:
loopmc
and passwordloopmc
Note: We suggest to give to the VM at least 4GB of RAM. For obvious reasons, the more the better.
There are four steps in our pipeline that start from a given set of benign and malicious apks.
cd ~/loopmc/loopmc_code
source ~/virtualenvs/loopmc/bin/activate
The following steps should be run from within the virtual env.
This steps will disassemble the provided APKs, analyze all the loops and emits JSON files.
Generating JSONS:
cd ~/loopmc/loopmc_code
python analyze_batch.py -d <directory_containing_apks> -r <directory_where_the_jsons_should_be_stored>
Example:
python analyze_batch.py -d /tmp/apks -r /tmp/apkjsons
analyze_batch.py
has other fancier options to run in multi-process mode. You are free to explore.
Dumping into a local database:
cd ~/loopmc/loopmc_code/ml_project/cellophane
python cellophane.py -j <directory_where_the_jsons_are_stored> --generate-db
The above command will create an sqlite DB in the file: /home/loopmc/loopmc/loopmc_code/ml_project/loopmc.db
This will create auxillary label files needed to create the feature vector.
cd ~/loopmc/loopmc_code/ml_project/feature_scripts
python get_all_labels.py <directory_where_the_jsons_are_stored>
python boolean_vectorize.py
cd ~/loopmc/loopmc_code/ml_project/clustering/sklearn
python fast_level_split.py --vector ../../feature_scripts/vector_data/vector.txt --idmap ../../feature_scripts/vector_data/loop_id_map.txt --available-labels ../../feature_scripts/features_data/available_labels.txt --label-json ../../feature_scripts/features_data/label_tree.json --outdir <directory_where_temp_clusters_should_be_stored>
This steps will generate the feature vector which could be used by any machine learning technique.
cd ~/loopmc/loopmc_code/ml_project/type_classification_scripts
python extract_type_genome.py <directory_where_the_jsons_are_stored> ../feature_scripts/vector_data/loop_id_map.txt <directory_where_temp_clusters_are_stored>/fast_leaf_level_split <output_directory>
Here, <directory_where_temp_clusters_are_stored>
is the directory provided in the Generating Label files step.
The above script will generate a file at : <output_directory>/type_clusters.csv
, that contains the feature vector for each APK.
NOTE: The script, extract_type_genome.py
creates the ground truth labels for APKs based on the name. If you want to change it, please change the function categorize_em_all
in the file.
Once the feature vector is created, you can follow the below steps to run Random forest:
cd ~/loopmc/loopmc_code/ml_project/type_classification_scripts
python print_RF_type_results.py <output_directory>/type_clusters.csv <directory_where_temp_clusters_are_stored>/fast_leaf_level_split
Where, <output_directory>/type_clusters.csv
is the file generated in the previous step and <directory_where_temp_clusters_are_stored>
is the folder provided in the Generating Label files step.
The results will be stored in the files: score_RF.txt
and the folder RF_trees
contains the individual trees and the estimator.