Explain in detail how Java implements FP Growth algorithm

  • 2021-09-24 22:36:30
  • OfStack

Java Implementation of FP-Growth Algorithm

This article focuses on the realization of 1. Two scans are required to build the FP tree

First scan

The first scan filters out all items that do not meet the minimum support; Items that meet the minimum support are sorted in descending order of global support.

With this requirement, the possible difficulty is how to sort item in each transaction according to global support.

My realization ideas

Scan the original data set and save it in the 2D list sourceData Maintain 1 Table so that it holds the global support TotalSup of each item Items below threshold minSup are filtered out in Table Convert Table to List and sort them in descending order of TotalSup Create a new 2D list freqSource, which retains the frequent items in sourceData and sorts each transaction in descending order of global support

Code


/**
     *  Scan the original data set , Generate transaction set 
     * @param path  Dataset path 
     * @throws IOException
     */

    private void scanDataSet(String path) throws IOException {
        if(path.equals("")){
            path = filePath;
        }
        FileReader fr = new FileReader(path);
        BufferedReader bufferedReader = new BufferedReader(fr);
        String str;
//        int maxLength = 0;
        while ( (str = bufferedReader.readLine())!=null){
            ArrayList<Integer> transaction = new ArrayList<>();
            String[] tempEntry ;
            tempEntry = str.split(" ");
            for(int i =0;i< tempEntry.length;i++){
                if(!tempEntry[i].equals("")){
                    int itemValue = Integer.parseInt(tempEntry[i]);
                    transaction.add(itemValue);
                    if(!similarSingleItemLinkedListHeadsTable.containsKey(itemValue)){
                        similarSingleItemLinkedListHeadsTable.put(itemValue, new SimilarSingleItemLinkedListHead(itemValue,null,1));
                    }else{
                        // Set the global support of the item +1
                        similarSingleItemLinkedListHeadsTable.get(itemValue).addSupTotal();
                    }
                }
            }
//            if(tempEntry.length>maxLength){
//                maxLength = tempEntry.length;
//            }

            sourceDataSet.add(transaction);

        }
//        System.out.println(maxLength);
        deleteNonFreqInSSILLHTAndSort();
        deleteNonFreqInSDSAndSort();
        bufferedReader.close();
        fr.close();
    }
        /**
     *  Remove the list of similar items (similarSingleItemLinkedListHeadsTable) Infrequent items of , And according to global support for similarSingleItemLinkedListHeads Descending sort 
     */
    private void deleteNonFreqInSSILLHTAndSort() {
        Hashtable<Integer,SimilarSingleItemLinkedListHead> copyOfSSILLHT =
                (Hashtable<Integer, SimilarSingleItemLinkedListHead>) similarSingleItemLinkedListHeadsTable.clone();
        Set<Integer> keySet = copyOfSSILLHT.keySet();
        // Delete infrequent items 
        for(int key: keySet){
            if(similarSingleItemLinkedListHeadsTable.get(key).getSupTotal()<minSupCnt){// Below the support threshold 
                similarSingleItemLinkedListHeadsTable.remove(key);
            }
        }
        // Sort by Global Support 
        similarSingleItemLinkedListHeadList = new ArrayList<>(similarSingleItemLinkedListHeadsTable.values());
        similarSingleItemLinkedListHeadList.sort(new Comparator<SimilarSingleItemLinkedListHead>() {
            @Override
            public int compare(SimilarSingleItemLinkedListHead o1, SimilarSingleItemLinkedListHead o2) {
                return o2.getSupTotal() - o1.getSupTotal();
            }
        });

    }
        /**
     *  Remove transaction set (sourceDataSet) Infrequent items of , And for each transaction according to global support item Sort in descending order 
     *  The results are stored in the freqSourceSortedDataSet
     */
    private void deleteNonFreqInSDSAndSort(){
        freqSourceSortedDataSet = (ArrayList<ArrayList<Integer>>) sourceDataSet.clone();
        for(int i =0;i<sourceDataSet.size();i++){
            for(int j = 0;j<sourceDataSet.get(i).size();j++){
                int item = sourceDataSet.get(i).get(j);
                //  Because at this time SSILLHT Items in are frequent items , Just make sure item Whether it exists in it or not , Existence means frequency .
                if(visitSupTotal(item)==-1){
                    // Mark infrequent items as the smallest integer value 
                    freqSourceSortedDataSet.get(i).set(j,Integer.MIN_VALUE);
                }
            }
            // Remove marked items .
            freqSourceSortedDataSet.get(i).removeIf(e->e == Integer.MIN_VALUE);
            insertSort(freqSourceSortedDataSet.get(i));
        }
        freqSourceSortedDataSet.removeIf(e->e.size() == 0);

    }

Second scan

In the second scan, the FP tree is constructed.
Participating in scanning is filtered data. If a data item is encountered for the first time, create the node and add a pointer to the node in headTable; Otherwise, find the node corresponding to the item according to the path, and modify the node information

It is relatively simple here, because there is already filtered and sorted data freqSourceSortedDataSet. We just need to

FP tree and similar necklace list are constructed by traversing every transaction trans of freqSourceSortedDataSet and every item of trans If an item is encountered for the first time, you need to create the node and link it in a similar necklace list. Needless to say, the linked list. The number of children of FP tree is indefinite, so special data structure is needed. I use HashTable here

  /**
     *  Build FP Trees 
     */
    private void buildFPTree(){
        for(ArrayList<Integer>trans:freqSourceSortedDataSet){
            Node curTreeNode = fpTree.root;
            for(int item :trans){
                if(!curTreeNode.children.containsKey(item)){
                    Node node = new Node(item,1);
                    curTreeNode.children.put(item,node);
                    node.father = curTreeNode;
                    buildSimilarSingleItemLinkedList(item,curTreeNode);
                }else{
                    curTreeNode.children.get(item).sup++;
                }
                curTreeNode=curTreeNode.children.get(item);
            }
        }
    }
    /**
     *  Build a list of similar necklaces 
     */
    private void buildSimilarSingleItemLinkedList(int item,Node curTreeNode){
        // Find the item Position in similar necklace list 

        int index = searchForItemInHeadsList(item,
                (ArrayList<SimilarSingleItemLinkedListHead>) similarSingleItemLinkedListHeadList);
        if(similarSingleItemLinkedListHeadList.get(index).next == null){
            similarSingleItemLinkedListHeadList.get(index).next = curTreeNode.children.get(item);
        }else{
            Node visitNode = similarSingleItemLinkedListHeadList.get(index).next;
            while (visitNode.nextSimilar!=null){

                visitNode = visitNode.nextSimilar;
            }
            if(visitNode != curTreeNode.children.get(item))
                visitNode.nextSimilar = curTreeNode.children.get(item);
        }
    }
    /**
     *  In HeadList Search for an item in the 
     * @param item  Items 
     * @param similarSingleItemLinkedListHeads  Head node linked list 
     * @return  Position ,-1 Indicates that it was not found 
     */
    private int searchForItemInHeadsList(int item, ArrayList<SimilarSingleItemLinkedListHead> similarSingleItemLinkedListHeads) {
        for(int i =0;i<similarSingleItemLinkedListHeads.size();i++){
            if(similarSingleItemLinkedListHeads.get(i).getItemValue() == item){
                return i;
            }
        }
        return -1;
    }
    

Mining frequent itemsets

I think this part is the most difficult part to realize. But I speak very quickly about this place at B station or other places (B station doesn't have two videos about this thing, vomit). There are different concepts, such as prefix paths in the Black Book, and conditional schema bases in other places. The following code is implemented according to the prefix path.

Let's stroke our thinking and what we need to do to dig out frequent item sets.

First, you need to traverse back-to-front every 1 item of the list of similar necklaces (this 1 list has been sorted by global support in the first scan).

Recursively perform the following steps for each item:

① Record the prefix path. The method I use is to record all nodes appearing in the prefix path with 1 HashSet.

The support degree of every 1item of the FP tree is recorded. Similar to the previous first scan.

③ According to the recorded support, if item is frequent, the item and the current suffix are frequent itemsets.

Then, according to record, the list of similar necklaces of the FP tree is constructed, and the infrequent items (similar to the first scan) and the FP tree under the current item composition condition are removed. There is no need to re-establish the structure of an FP tree to form a conditional FP tree, because recording prefix paths only needs to access similar entries and parent entries.

The remaining items in the list of similar necklaces are further executed with step ① until there are no items in the list of similar necklaces, which is terminated.


/**
     *  Algorithm execution function 
     * @param minSupCnt  Minimum support count 
     * @param path  File path 
     * @param pT  Threshold of itemset size for output results 
     */
    public void run(int minSupCnt,String path,int pT) throws IOException {
        this.printThreshold = pT;
        this.minSupCnt = minSupCnt;
        scanDataSet(path);
        buildFPTree();
        for(int i = similarSingleItemLinkedListHeadList.size()-1;i>=0;i--){
            genFreqItemSet(similarSingleItemLinkedListHeadList.get(i).getItemValue()
                    ,fpTree,similarSingleItemLinkedListHeadList,new TreeSet<>());
        }
        //genFreqItemSet(14,fpTree,similarSingleItemLinkedListHeadList,new TreeSet<>());
        System.out.println(" Number of frequent itemsets :\t"+cntOfFreqSet);
    }
/**
     *  Generate frequent itemsets 
     * @param last  Last item 
     * @param fPTree  Condition FP Trees 
     * @param fatherSimilarSingleItemLinkedListHeads  Linked list of similar header nodes of parent tree 
     * @param freqItemSet  Frequent itemset 
     */
    private void genFreqItemSet(int last,FPTree fPTree,
                                List<SimilarSingleItemLinkedListHead>fatherSimilarSingleItemLinkedListHeads,TreeSet<Integer>freqItemSet) {

        FPTree conditionalFPTree = new FPTree();
        List<SimilarSingleItemLinkedListHead>similarSingleItemLinkedListHeads = new ArrayList<>();

        TreeSet<Integer>localFreqItemSet = (TreeSet<Integer>) freqItemSet.clone();
        int index ;
        index = searchForItemInHeadsList(last,
                (ArrayList<SimilarSingleItemLinkedListHead>) fatherSimilarSingleItemLinkedListHeads);

        Node firstNode = fatherSimilarSingleItemLinkedListHeads.get(index).next;
        HashSet<Node>record = new HashSet<>();  // Used to record the nodes appearing on the prefix path 
        // Record prefix path 
        if(firstNode!=null){
            record.add(firstNode);
            Node nodeToVisitFather = firstNode;
            Node nodeToVisitSimilar = firstNode;
            while (nodeToVisitSimilar!=null){
                nodeToVisitSimilar.supInCFP = nodeToVisitSimilar.sup;
                nodeToVisitFather = nodeToVisitSimilar;
                while (nodeToVisitFather!=null){
                    //  Calculation supInCFT
                    if(nodeToVisitFather!=nodeToVisitSimilar)
                        nodeToVisitFather.supInCFP += nodeToVisitSimilar.supInCFP;
                    record.add(nodeToVisitFather);
                    nodeToVisitFather = nodeToVisitFather.father;
                }
                nodeToVisitSimilar = nodeToVisitSimilar.nextSimilar;
            }

            // Record support in subtree 
            Hashtable<Integer,Integer> supRecord = new Hashtable<>();
            record.forEach(new Consumer<Node>() {
                @Override
                public void accept(Node node) {
                    int item = node.item;
                    if(item == -1 ){    // Root node 
                        return;
                    }
                    if(supRecord.containsKey(item)){
                        supRecord.put(item,supRecord.get(item)+ node.supInCFP);
                    }else{
                        supRecord.put(item,node.supInCFP);
                    }

                }
            });
            // Output result 
            if(supRecord.get(last)>=minSupCnt){
                localFreqItemSet.add(last);
                if(localFreqItemSet.size()>=printThreshold && !result.contains(localFreqItemSet)){
                    cntOfFreqSet++;
//                    for(int i = localFreqItemSet.size()-1;i>=0;i--){
//                        System.out.print(localFreqItemSet.get(i)+" ");
//                    }
                    localFreqItemSet.forEach(new Consumer<Integer>() {
                        @Override
                        public void accept(Integer integer) {
                            System.out.print(integer+" ");
                        }
                    });
                    result.add(localFreqItemSet);

                    System.out.println("");
                }
            }

            // Build a list of similar necklaces 
            record.forEach(new Consumer<Node>() {
                @Override
                public void accept(Node node) {
                    if(node.item == -1){    // Root node 
                        Node visitNode = node;
                        buildConditionalFPTree(conditionalFPTree.root, visitNode,record,
                                (ArrayList<SimilarSingleItemLinkedListHead>) similarSingleItemLinkedListHeads,supRecord,last);
                    }
                }
            });
            // Sort in descending order of support 
            similarSingleItemLinkedListHeads.sort(new Comparator<SimilarSingleItemLinkedListHead>() {
                @Override
                public int compare(SimilarSingleItemLinkedListHead o1, SimilarSingleItemLinkedListHead o2) {
                    return o2.getSupTotal() - o1.getSupTotal();
                }
            });

            if(similarSingleItemLinkedListHeads.size()>=1){
                // Recursive search for frequent items 
                for(int i =similarSingleItemLinkedListHeads.size()-1;i>=0;i--){
                    genFreqItemSet(similarSingleItemLinkedListHeads.get(i).getItemValue(),
                            conditionalFPTree,similarSingleItemLinkedListHeads,localFreqItemSet);
                    // similarSingleItemLinkedListHeads.remove(i);
                }
            }
        }
    }
/**
     *  Recursive build condition FP Trees 
     * @param rootNode  Establish a condition down with this node as the root FP Trees 
     * @param originalNode  rootNode Corresponding to nodes in the original tree 
     * @param record     Prefix path 
     * @param similarSingleItemLinkedListHeads   Similar item header linked list 
     * @param supRecord  Record of support count 
     * @param last  Last item 
     */
    private void buildConditionalFPTree(Node rootNode,Node originalNode,HashSet<Node>record
            ,ArrayList<SimilarSingleItemLinkedListHead>similarSingleItemLinkedListHeads,Hashtable<Integer,Integer>supRecord,int last){
        if(originalNode.children!=null){
            for(int key:originalNode.children.keySet()){    // Traversal originalNode All son nodes of , Check whether it is in the prefix path 
                Node tempNode = originalNode.children.get(key);
                if(record.contains(tempNode)){
                    Node addedNode = new Node(tempNode.item, tempNode.supInCFP);
                    if(last == key){    // Removal last All nodes of the 
                        tempNode.supInCFP = 0;
                        continue;
                    }
                    if(supRecord.get(key)>=minSupCnt){
                        //addedNode  Copy  tempNode Attributes other than son nodes 
                        addedNode.supInCFP = tempNode.supInCFP;
                        rootNode.children.put(tempNode.item, addedNode);
                        addedNode.father = rootNode;
                        // Build a table of similar items 
                        int i = searchForItemInHeadsList(tempNode.item,similarSingleItemLinkedListHeads);
                        if(i==-1){
                            similarSingleItemLinkedListHeads.add(new SimilarSingleItemLinkedListHead(key,addedNode, addedNode.supInCFP));
                        }else{
                            similarSingleItemLinkedListHeads.get(i).setSupTotal(similarSingleItemLinkedListHeads.get(i).getSupTotal()+addedNode.supInCFP);
                            Node visitNode = similarSingleItemLinkedListHeads.get(i).next;
                             while (visitNode.nextSimilar!=null){
                                visitNode = visitNode.nextSimilar;
                            }
                            if(visitNode!=addedNode){
                                visitNode.nextSimilar= addedNode;
                            }
                        }
                        buildConditionalFPTree(addedNode,originalNode.children.get(key),record,similarSingleItemLinkedListHeads,supRecord,last);
                        addedNode.supInCFP = 0; // Will supInCFP Reset to 0;
                    }else{
                        buildConditionalFPTree(rootNode,originalNode.children.get(key),record,similarSingleItemLinkedListHeads,supRecord,last);
                    }

                }
            }
        }
    }

Complete code

FP-Growth


Related articles: