Find the copied assignment OR plagiarism finder OR find matching subsequences

4 min readSep 25, 2021

We are given a set of documents. Each document is submitted by a different individual. However, we suspect that some individuals may have copied from others. Given a plagiarized submitted document, we want to identify the number of documents with which there is a potential match. We have converted each document into a set of tokens based on their content. As mentioned previously, the students could have added dummy statements between the copied content to avoid identification. We’ll have to match the tokens of two students while taking into account that there can be dummy tokens that might not match. A potential match can occur if one token results in the subsequence of the other token. It is not a guarantee that every match is plagiarized content. In this scenario, we’ll discard the matched tokens that have a length less than two.

We’ll be provided with a string, plagiarised, and an array, students. The plagiarised will contain the tokens against which we’ll match the code samples present in the students array. We have to return the number of possible students in a class the plagiarised content may have been copied from.

Solution

The size of the plagiarised variable can be very large considering the amount of code converted to tokens. We’ll try to traverse this string only once to efficiently identify plagiarism. We can create a dictionary, called waitingList, for code tokens in students. students is an array containing tokenized submitted documents by different students. For example, four students submitted different documents that have been tokenized to a, bb, acd, and ace, respectively. In this scenario, ace may have been copied from a, with an addition of ce to try to deceive the plagiarism detector.

Initially, we’ll group the tokens in students with their starting character. For example, if we have students = ['a', 'bb', 'acd', 'ace'], we can group them as 'a' : ('a', 'acd', 'ace'), 'b' : ('bb'). The words are grouped by what letter they are currently waiting for to appear in plagiarised as we traverse through it. While traversing over plagiarised, we’ll keep modifying our waitingList depending upon the next waiting letter. For example, consider a string like plagiarised = 'abcde':

waitingList = 'a' : ('a', 'acd', 'ace'), 'b' : ('bb') at beginning;

waitingList = 'c' : ('cd', 'ce'), 'b' : ('bb'), None : ('a') after plagiarised[i] = 'a';

waitingList = 'c' : ('cd', 'ce'), 'b' : ('b'), None : ('a') after plagiarised[i] = 'b';

waitingList = 'd' : ('d'), 'e' : ('e'), 'b' : ('b'), None : ('a') after plagiarised[i] = 'c';

waitingList = 'e' : ('e'), 'b' : ('b'), None : ('a', 'acd') after plagiarised[i] = 'd';

waitingList = 'b' : ('b'), None : ('a', 'acd', 'ace') after plagiarised[i] = 'e';

We will keep track of all the code tokens in the students array simultaneously. The dictionary structure tracks the progress of how much each word in students matches with plagiarised. We update the progress of each word in students whose letter occurs in plagiarised and store the modified arrays. If we use the last letter of a word, we add it as a value against the None key. Finally, comparing the length of the array against the None key will give us the number of possible matches.

The following illustration will clarify this process further.