Starting from:

$25

CS5510-Project 3 Solved

Building Line-level Defect Detection Models

In this project, you are expected to learn how to build a defect prediction model for software source code from scratch. You are required to apply deep-learning techniques, e.g., classification, tokenization, embedding, etc., to build more accurate prediction models with the dataset provided.

Background
Line-level Defect classifiers predict which lines in a file are likely to be buggy.

A typical line-level defect prediction using deep-learning consists of the following steps:

Data extraction and labeling: Mining buggy and clean lines from a large dataset of software changes (usually GitHub).
Tokenization and pre-processing: Deep learning algorithms take a vector as input. Since source code is text, it needs to be tokenized and transformed into a vector before being fed to the model.
Model Building: Using the tokenized data and labels to train a deep learning classifier. Many different classifiers have been shown to work for text input (RNNs and CNNs). Most of these models can be built using TensorFlow.
Defect Detection: Unlabelled instances (i.e., line of codes or files) are fed to the trained model that will classify them as buggy or clean.
Evaluation Metrics
Metrics, i.e., Precision, Recall, and F1, are widely used to measure the performance of defect prediction models. Here is a brief introduction:

                                                                                                          (1)

                                                                                                               (2)

                                                                                                                                        (3)

These metrics rely on four main numbers: true positive, false positive, true negative, and false negative. True positive is the number of predicted defective instances that are truly defective, while false positive is the number of predicted defective ones that are actually not defective. True positive records the number of predicted non-defective instances that are actually defective, while false negative is the number of predicted non-defective instances that are actually defective. F1 is the weighted average of precision and recall.

These methods are threshold-dependent and are not the best to evaluate binary classifiers. In this project, we will also use the Receiver operating characteristic curve (ROC curve) and its associated metric, Area under the ROC curve (AUC) to evaluate our trained models independently from any thresholds. The ROC curve is created by plotting the true positive rate (or recall, see definition above) against the false positive rate at various threshold settings.

                                                                                     (4)

(I)- Using TensorFlow to build a simple classification model
Part I will guide you through building a simple bidirectional LSTM model, while part II and III will let you explore different ways to improve it.

CS Linux Servers have the environment ready to use. The following instructions assume using one of these machines unless stated otherwise.

(We’ve tested on mc18.cs.purdue.edu and cuda.cs.purdue.edu. Other mc machines may or may not work)

The environment uses Python 3 and virtualenv. For more information on how to use virtualenv, please look at the virtualenv documentation  source /homes/cs510/project −3/venv/bin/activate

*If you work on your own machine, after you created your virtualenv session and activated it, you can install the required library using the requirements.txt file we provided:

pip install −−upgrade pip pip install −r requirements . txt

0.1         Load the Input Data:

Since the dataset is quite large (9GB uncompressed), we put it in /homes/cs510/project-3/data folder on the servers.

You can also download it from if you want to work on your own machine.

If you prefer to work on your own machine, you will need to download the data and update the path in tokenization.py

The training, validation and test data are made available in pickled Pandas dataframes, respectively in train.pickle, valid.pickle, and test.pickle

The panda dataframes consists of 4 columns:

instance: the line under test
contextbefore: the context of the line under test right before the line. In this question, the context before consists of all the lines in the functions before the tested line.
contextafter: the context of the line under test right after the line. In this question, the context after consists of all the lines in the functions after the tested line.
is buggy: the label of the line tested. 0 means the line is not buggy, 1 means the line is buggy.
The first step is to load the data and tokenize it. To load the data, use the following code (modify the paths if necessary):

# Load the data :

with open( ’data/train . pickle ’ , ’rb ’ ) as handle : train = pickle . load ( handle )

with open( ’data/valid . pickle ’ , ’rb ’ ) as handle : valid = pickle . load ( handle )

with open( ’data/ test . pickle ’ ,                                   ’rb ’ ) as handle :

test = pickle . load ( handle )

The custom tokenizer implemented in tokenization.py is a basic java tokenizer from the javalang library ) that is enhanced to also abstract string literals and numbers different from 0 and 1.

# Tokenize and shape our input : def custom tokenize ( string ) :

try : tokens = list ( javalang . tokenizer . tokenize ( string ) )

except : return [ ]

values = [ ] for token in tokens :

# Abstract strings if ’” ’ in token . value or ” ’” in token . value :

values . append( ’$STRING$ ’ )

# Abstract numbers ( except 0 and 1) elif token . value . isdigit () and int ( token . value ) 1:

values . append( ’$NUMBER$’ )

#other wise : get the value else : values . append( token . value )

return values

def tokenize df ( df ) :

df [ ’ instance ’ ] = df [ ’ instance ’ ] . apply(lambda x: custom tokenize (x) ) df [ ’ context before ’ ] = df [ ’ contextbefore ’ ] . apply(lambda x: custom tokenize (x) ) df [ ’ context after ’ ] = df [ ’ contextafter ’ ] . apply(lambda x: custom tokenize (x) )

return df

test = tokenize df ( test ) train = tokenize df ( train ) valid = tokenize df ( valid )

with open( ’data/tokenizedtrain . pickle ’ , ’wb’ ) as handle : pickle .dump( train , handle , protocol=pickle .HIGHEST PROTOCOL)

with open( ’data/tokenizedvalid . pickle ’ , ’wb’ ) as handle : pickle .dump( valid , handle , protocol=pickle .HIGHEST PROTOCOL)

with open( ’data/ tokenizedtest . pickle ’ ,                                             ’wb’ ) as handle :

                pickle .dump( test , handle ,                        protocol=pickle .HIGHEST PROTOCOL)

Loading the data and tokenizing it can be done by running the script:

python tokenization .py

The tokenized dataset will be saved in the data folder under proj-skeleton (not data folder under /homes/cs510/project3). You can change it if necessary. The tokenization should take about 80 minutes.

0.2        Preprocessing data
Once we have the tokenized data, we need to transform them into vectors before feeding them to the deep learning model.

This part can be done by running the script:

python preprocess .py

It will do the transformation and save the transformed data (x train.pickle, etc.) under data folder.

For this question, we represent each instance as one vector of tokens: tokenized context before, < START , tokenized line under test, < END , tokenized context after

The tokens < START and < END indicates when the line under test starts.

For this question, we will only keep 50,000 training instances to save time. You can try to use larger dataset (1 million or more) in part II-IV.

Loading tokenized data and reshaping the input:

# Loading tokenized data with open( ’data/tokenizedtrain . pickle ’ , ’rb ’ ) as handle : train = pickle . load ( handle )

with open( ’data/tokenizedvalid . pickle ’ , ’rb ’ ) as handle : valid = pickle . load ( handle )

with open( ’data/ tokenizedtest . pickle ’ , ’rb ’ ) as handle : test = pickle . load ( handle )

# Reshape instances :

def reshape instances ( df ) :

df [ ”input” ] = df [ ”context before” ] . apply(lambda x: ” ” . join (x) ) + ” <START ” + df [ ”instance” ] .

apply(lambda x: ” ” . join (x) ) + ” <END ” + df [ ” context after ” ] . apply(lambda x: ” ” . join (x) ) X df = [ ]

Y df = [ ]

for index , rows in df . iterrows () : X df . append(rows . input)

Y df . append(rows . is buggy ) return X df , Y df

Xtrain ,       Y train = reshape instances ( train ) Xtest ,                 Y test = reshape instances ( test )

Xvalid ,                            Y valid = reshape instances ( valid )

Xtrain = X train [:50000]

Ytrain = Y train [:50000]

Xtest = X test [:25000]

Ytest = Ytest [:25000]

Xvalid = Xvalid [:25000]

Yvalid = Yvalid [:25000]

Since the deep learning model takes a fixed-length vector of numbers as input, we use the training set to build a vocabulary that maps each token to a number. Then we encode our training, testing and validation instances and created vectors of fixed length representing the encoded instances. We limit the size of an instance to 1,000 tokens. In part II-IV, you might want to experiment with different vector sizes.

# Build vocabulary and encoder from the training instances maxlen = 1000 vocabulary set = set () for data in X train : vocabulary set . update(data . split () )

vocab size = len( vocabulary set )

print( vocab size )

# Encode training , valid and test instances encoder = tfds . features . text . TokenTextEncoder( vocabulary set )

def encode( text ) :

encoded text = encoder . encode( text )

return encoded text

Xtrain = list (map(lambda x: encode(x) , Xtrain ) ) Xtest = list (map(lambda x: encode(x) , Xtest ) )

Xvalid = list (map(lambda x: encode(x) , Xvalid ) )

Xtrain = padsequences ( Xtrain , maxlen=maxlen) Xtest = padsequences ( Xtest , maxlen=maxlen)

Xvalid = padsequences ( Xvalid , maxlen=maxlen)

0.3        Training the model
Training and evaluation of the model is done by train and test.py

For our first model, we will try to train a two layers bidirectional RNN model using LSTM layers. RNNs have been known to work well with text data. A tutorial showing how to create a basic RNN model with TensorFlow is available on 

Our model will be defined as followed:

# Model Definition model = tf . keras . Sequential ([ tf . keras . layers . Embedding( encoder . vocab size , 64) , tf . keras . layers . Bidirectional ( tf . keras . layers .LSTM(64 , return sequences=True) ) , tf . keras . layers . Bidirectional ( tf . keras . layers .LSTM(32) ) , tf . keras . layers . Dense(64 , activation=’ relu ’ ) , tf . keras . layers . Dropout (0.5) ,

                tf . keras . layers . Dense(1 ,                     activation=’ sigmoid ’ )

])

model . compile( loss=’ binary crossentropy ’ , optimizer=tf . keras . optimizers .Adam(1e−4) , metrics=[ ’ accuracy ’ ])

model .summary()

Since the data is pretty large, we might not be able to fit an embedding for the entire dataset in memory. Therefore, we need to build a batch generator to generate the embedding for the input data on the fly.

# Building generators class CustomGenerator(Sequence) :

         def          i n i t         ( self ,        text ,         labels ,                batch size , num steps=None) :

self . text , self . labels = text , labels self . batch size = batchsize

self . len = np. ceil (len( self . text ) / float ( self . batch size ) ) . astype (np. int64 ) if num steps : self . len = min(numsteps , self . len)

         def           len          ( self ) :

return self . len

         def            getitem           ( self ,        idx ) :

batch x = self . text [ idx ∗ self . batch size :( idx + 1) ∗ self . batch size ] batch y = self . labels [ idx ∗ self . batch size :( idx + 1) ∗ self . batch size ] return batch x ,       batch y

traingen = CustomGenerator( Xtrain ,                 Y train ,       batch size ) validgen = CustomGenerator( Xvalid ,       Y valid ,       batch size ) testgen = CustomGenerator( Xtest ,           Y test ,         batch size )

We feed this data generator and start training the model as shown below:

# Training the model checkpointer = ModelCheckpoint( ’data/models/model−{epoch :02d}−{val loss :.5 f }. hdf5 ’ ,

monitor=’ val loss ’ , verbose=1, save best only=True , mode=’min ’ )

callback list = [ checkpointer ] #, , reduce lr his1 = model . fit generator ( generator=train gen , epochs=1,

validation data=valid gen , callbacks=callback list )

0.4        Evaluating the model
Once the model is trained, we evaluate it on the test set. predict generator will generate a probability of a given instance to be buggy or clean.

Traditionally, instances will be then classified in the class 0 (i.e., clean) if the probability is lower than 50%, and in the class 1 (i.e., buggy) if the probability is higher. However, using the 50% threshold might not be the best choice and using a different threshold might provide better results. Therefore, to take into consideration the impact of the threshold, we draw the ROC curve and use the AUC (area under the curve metrics) to measure the correctness of our classifier. predIdxs = model . predict generator ( test gen , verbose=1)

fpr , tpr ,     = roc curve ( Y test ,                   predIdxs ) roc auc = auc( fpr ,                   tpr )

plt . figure ()

lw = 2

plt . plot ( fpr , tpr , color=’ darkorange ’ , lw=lw , label=’ROC curve ( area = %0.2 f ) ’ % roc auc ) plt . plot ([0 , 1] , [0 , 1] , color=’navy ’ , lw=lw , linestyle=’−−’ ) plt . xlim ([0.0 , 1.0]) plt . ylim ([0.0 , 1.05]) plt . xlabel ( ’ False Positive Rate ’ ) plt . ylabel ( ’True Positive Rate ’ ) plt . t i t l e ( ’ Receiver operating characteristic example ’ ) plt . legend ( loc=”lower right”) plt . savefig ( ’auc model . png ’ )

For Part (I), please include auc model.png in your report, and measure the buggy rate (i.e. % of instances labeled 1) in the training, validation and test instances.

(II)- Improving the results by using a better deep-learning algorithm
The model trained in part (I) is simple and does not perform very well. In the past few years, many different models to classify text inputs for diverse tasks (content tagging, sentiment analysis, translation, etc.) have been proposed in the literature. In part (II), you will look at the literature and apply a different deep-learning algorithm to do defect prediction. You can, and are encouraged to use or adapt models that have been proposed by other people for other tasks. Please cite your source and provide a link to a paper or/and GitHub repository showing that this algorithm has been applied successfully for text classification, modeling or generation tasks.

Examples of models to try:

Hierarchical Attention Networks for Document Classification:

 

Independently Recurrent Neural Network (IndRNN): Building A Longer andDeeper RNN:

Feed-Forward Networks with Attention Can Solve Some Long-Term Memory Problems: s You can also look at more complex models like BERT, Elmo or XLNet.

You can search GitHub for text-classification models and pick the one you like!

We strongly recommend that you do not implement the CNN-only models from this paper:. We have extensively tested this model for our specific task and we already know it does not work well.

Report: For this question, please put in your report the model you chose, a link to the paper and/or GitHub repository where you got the model, a small discussion why you chose to try this model, your source code, and an evaluation of your trained model on the test set (AUC and ROC curve). If you get any improvement compared to the model used in part I, please report it too.

If the model you pick is too complex and takes too long to train on the entire training set, please provide an explanation indicating how much time the model would take to train on the entire dataset and only train your model on a sample of the dataset.

(III)- Other ways to improve the results
In this question, you will try to improve the model you worked with in Part II using different methods. Chose at least two of the methods below to try to improve the results you got in part II: Report which methods you use and its impact on the results and training time.

Use more training data: In part I), we only used 50,000 instances to train our model. You can try to train your model with the entire training set instead. Based on our experience, using 1 million instances produce much higher AUC than using 50,000 instances. Generally, the more training instances, the higher AUC until it saturates. The constraint is machine time.

Data cleaning: The input data we provided is automatically extracted from GitHub and likely contains a lot of noise. To improve the results, one possibility is to clean the datasets. You can investigate a bit more the raw data and try to clean the input data. Examples (non-exhaustive) of challenges to investigate and solve are:

Duplicate instances: Are there any instances that are labeled both buggy and clean?
Length of the input: What is the average length of an instance, are there any outliers? Does removing outliers improve the results?
Quality of the input: Comments have not been removed from the inputs? Does removing comments help to improve the results?
Tokenization and input abstraction:

In this project, we use a simple tokenization using a java tokenizer and basic abstraction of strings and numbers. This has the inconvenience of creating a gigantic vocabulary that might be difficult to learn for a deep learning network. Many different tokenizers or abstractions can be tried:

Source code contains structured information that could help abstract data to reduce the vocabulary size. For example, all variables could be abstracted to the same token variable, all method calls to the token methodcall, types to type, etc. You can also distinguish between different variables in the same instance by abstracting different variables with slightly different tokens (e.g., var1, var2, etc. Such information can be extracted from an AST or a java Parser (the javalang library contains a basic AST parser that could be used). Using such an abstract will significantly reduce the vocabulary and might help the algorithm to learn.
Subword tokenizers have been used in NLP. You can try tokenizers like SentencePiece (, or word pieces 
You can also build your own tokenizer.
Context representation: In part I), the context is represented as a sequence of tokens from the entire function. In addition, both the context and the line under test are represented similarly and fed as one input. This might not be the best way to represent the context of a bug. You can propose a different approach to represent the context of a bug:

You can try to represent the context differently (e.g., use a higher-level abstraction, only use a set of tokens instead of a sequence)
In this project, we use the entire function as context. This provides a lot of information, but it likely also contains a lot of noise (e.g. irrelevant statements). You can try to use a different context (e.g., reduce the context to only consider the basic block surrounding the line under test).
You can try to feed the context and the instance under test as different inputs.
Tuning and Building Deeper models: Deep learning models contain a lot of hyper-parameters that can be tuned (e.g., number of epoch trained, number and size of layers, dropout rate, learning rate, etc.). Using different hyper-parameters can lead to very different results. One way to improve the results of a classifier is to pick the ”best” hyper-parameters by tuning the model.

Using different Learning methods: Sometimes, learning from one model and one dataset is not enough to achieve good results. There are several possibilities to improve the models:

Use pre-trained embedding to have a better source code representation. Much work has been done to represent source code from a very large corpus. Instead of training our embedding layers from our limited training data, you could use a pre-trained embedding (e.g. such as the ones proposed in code2seqor train your own embedding (e.g., GloVe or Word2Vec) before training the classifier.
It is easier to learn from simple instances first. Curriculum Learning has been proposed to help to learn easier instances first. 
Use ensemble learning. One model might not be enough to learn all buggy lines. Instead of building one single model, a combination of several smaller models (trained with different training data or using different hyperparameters) might provide better performances.
(IV)- Further improvements (competition) - for Bonus
You are also highly encouraged to improve the defect prediction models by using other techniques beyond the ones we recommended or to try to combine all of them to further improve your model.

(Optional) Use GPU for your training
GPU can drastically accelerate the speed of training. In this part, we will guide you to use tensorflow-gpu to train the model.

Server cuda.cs.purdue.edu is equipped with 6 GPUs capable of deep learning. You should have access to this server. However, most of the time, its GPUs are occupied by others, which is out of our control. We highly recommend that you consider using GPUs if you have access to one.

Use GPU of cuda.cs.purdue.edu
We have environment ready to use on this server. Run nvidia-smi to check the avaibility of GPUs before you start. To run training on this server using GPU, you should follow the steps:

module load cuda/10.0

source /homes/cs510/project −3/venv−gpu/bin/activate

python train and test .py

You may get a OUT OF MEMORY error if no GPU is available at that time. Since we don’t have control over the server, we cannot guarantee your access to the GPU. You may try at different time.

Use your own GPU
If you have control over a machine with Nvidia GPU. You may use tensorflow-gpu to accelerate your training (The performance is varied based on the model).

Prerequisites

Python3
CUDA Toolkit 10.0 
cuDNN (Any version that is compatible with cuda10.0
Once you have meet the prerequisites, you can create a virtualenv and use the provide requirements-gpu.txt to setup your environment.

python3 −m venv path/to/venv

source path/to/venv/bin/activate pip install −−upgrade pip pip install −r requirements−gpu . txt

Then, you should be ready to train your model on a large dataset faster.

More products