How may I un-encode the features from a decision tree to see the important features?












0















I have a dataset that I am working with. I am converting them from categorical features to numerical features for my decision tree. The conversion happens on the entire data frame with the following lines:



le = LE()
df = df.apply(le.fit_transform)


I later take this data and split it into training and testing data with the following:



target = ['label']
df_y = df['label']
df_x = df.drop(target, axis=1)

# Split into training and testing data
train_x, test_x, train_y, test_y = tts(df_x, df_y, test_size=0.3, random_state=42)


Then I am passing it to a method to train a decision tree:



def Decision_Tree_Classifier(train_x, train_y, test_x, test_y, le):
print " - Candidate: Decision Tree Classifier"
dec_tree_classifier = DecisionTreeClassifier(random_state=0) # Load Module
dec_tree_classifier.fit(train_x, train_y) # Fit
accuracy = dec_tree_classifier.score(test_x, test_y) # Acc
predicted = dec_tree_classifier.predict(test_x)
mse = mean_squared_error(test_y, predicted)

tree_feat = list(le.inverse_transform(dec_tree_classifier.tree_.feature))
print "Tree Features:"
print tree_feat
print "Tree Thresholds:"
print dec_tree_classifier.tree_.threshold

scores = cross_val_score(dec_tree_classifier, test_x, test_y.values.ravel(), cv=10)
return (accuracy, mse, scores.mean(), scores.std())


In the above method, I am passing the LabelEncoder object originally used to encode the dataframe. I have the line



tree_feat = list(le.inverse_transform(dec_tree_classifier.tree_.feature))


To try and convert the features back to their original categoric representation, but I keep getting this stack trace error:



  File "<ipython-input-6-c2005f8661bc>", line 1, in <module>
runfile('main.py', wdir='/Users/mydir)

File "/Users/me/anaconda2/lib/python2.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 668, in runfile
execfile(filename, namespace)

File "/Users/me/anaconda2/lib/python2.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 100, in execfile
builtins.execfile(filename, *where)

File "/Users/me/mydir/main.py", line 125, in <module>
main() # Run main routine

File "candidates.py", line 175, in get_baseline
dec_tre_acc = Decision_Tree_Classifier(train_x, train_y, test_x, test_y, le)

File "candidates.py", line 40, in Decision_Tree_Classifier
tree_feat = list(le.inverse_transform(dec_tree_classifier.tree_.feature))

File "/Users/me/anaconda2/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 281, in inverse_transform
"y contains previously unseen labels: %s" % str(diff))

ValueError: y contains previously unseen labels: [-2]


What do I need to change to be able to look at the actual features themselves?










share|improve this question



























    0















    I have a dataset that I am working with. I am converting them from categorical features to numerical features for my decision tree. The conversion happens on the entire data frame with the following lines:



    le = LE()
    df = df.apply(le.fit_transform)


    I later take this data and split it into training and testing data with the following:



    target = ['label']
    df_y = df['label']
    df_x = df.drop(target, axis=1)

    # Split into training and testing data
    train_x, test_x, train_y, test_y = tts(df_x, df_y, test_size=0.3, random_state=42)


    Then I am passing it to a method to train a decision tree:



    def Decision_Tree_Classifier(train_x, train_y, test_x, test_y, le):
    print " - Candidate: Decision Tree Classifier"
    dec_tree_classifier = DecisionTreeClassifier(random_state=0) # Load Module
    dec_tree_classifier.fit(train_x, train_y) # Fit
    accuracy = dec_tree_classifier.score(test_x, test_y) # Acc
    predicted = dec_tree_classifier.predict(test_x)
    mse = mean_squared_error(test_y, predicted)

    tree_feat = list(le.inverse_transform(dec_tree_classifier.tree_.feature))
    print "Tree Features:"
    print tree_feat
    print "Tree Thresholds:"
    print dec_tree_classifier.tree_.threshold

    scores = cross_val_score(dec_tree_classifier, test_x, test_y.values.ravel(), cv=10)
    return (accuracy, mse, scores.mean(), scores.std())


    In the above method, I am passing the LabelEncoder object originally used to encode the dataframe. I have the line



    tree_feat = list(le.inverse_transform(dec_tree_classifier.tree_.feature))


    To try and convert the features back to their original categoric representation, but I keep getting this stack trace error:



      File "<ipython-input-6-c2005f8661bc>", line 1, in <module>
    runfile('main.py', wdir='/Users/mydir)

    File "/Users/me/anaconda2/lib/python2.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 668, in runfile
    execfile(filename, namespace)

    File "/Users/me/anaconda2/lib/python2.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 100, in execfile
    builtins.execfile(filename, *where)

    File "/Users/me/mydir/main.py", line 125, in <module>
    main() # Run main routine

    File "candidates.py", line 175, in get_baseline
    dec_tre_acc = Decision_Tree_Classifier(train_x, train_y, test_x, test_y, le)

    File "candidates.py", line 40, in Decision_Tree_Classifier
    tree_feat = list(le.inverse_transform(dec_tree_classifier.tree_.feature))

    File "/Users/me/anaconda2/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 281, in inverse_transform
    "y contains previously unseen labels: %s" % str(diff))

    ValueError: y contains previously unseen labels: [-2]


    What do I need to change to be able to look at the actual features themselves?










    share|improve this question

























      0












      0








      0








      I have a dataset that I am working with. I am converting them from categorical features to numerical features for my decision tree. The conversion happens on the entire data frame with the following lines:



      le = LE()
      df = df.apply(le.fit_transform)


      I later take this data and split it into training and testing data with the following:



      target = ['label']
      df_y = df['label']
      df_x = df.drop(target, axis=1)

      # Split into training and testing data
      train_x, test_x, train_y, test_y = tts(df_x, df_y, test_size=0.3, random_state=42)


      Then I am passing it to a method to train a decision tree:



      def Decision_Tree_Classifier(train_x, train_y, test_x, test_y, le):
      print " - Candidate: Decision Tree Classifier"
      dec_tree_classifier = DecisionTreeClassifier(random_state=0) # Load Module
      dec_tree_classifier.fit(train_x, train_y) # Fit
      accuracy = dec_tree_classifier.score(test_x, test_y) # Acc
      predicted = dec_tree_classifier.predict(test_x)
      mse = mean_squared_error(test_y, predicted)

      tree_feat = list(le.inverse_transform(dec_tree_classifier.tree_.feature))
      print "Tree Features:"
      print tree_feat
      print "Tree Thresholds:"
      print dec_tree_classifier.tree_.threshold

      scores = cross_val_score(dec_tree_classifier, test_x, test_y.values.ravel(), cv=10)
      return (accuracy, mse, scores.mean(), scores.std())


      In the above method, I am passing the LabelEncoder object originally used to encode the dataframe. I have the line



      tree_feat = list(le.inverse_transform(dec_tree_classifier.tree_.feature))


      To try and convert the features back to their original categoric representation, but I keep getting this stack trace error:



        File "<ipython-input-6-c2005f8661bc>", line 1, in <module>
      runfile('main.py', wdir='/Users/mydir)

      File "/Users/me/anaconda2/lib/python2.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 668, in runfile
      execfile(filename, namespace)

      File "/Users/me/anaconda2/lib/python2.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 100, in execfile
      builtins.execfile(filename, *where)

      File "/Users/me/mydir/main.py", line 125, in <module>
      main() # Run main routine

      File "candidates.py", line 175, in get_baseline
      dec_tre_acc = Decision_Tree_Classifier(train_x, train_y, test_x, test_y, le)

      File "candidates.py", line 40, in Decision_Tree_Classifier
      tree_feat = list(le.inverse_transform(dec_tree_classifier.tree_.feature))

      File "/Users/me/anaconda2/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 281, in inverse_transform
      "y contains previously unseen labels: %s" % str(diff))

      ValueError: y contains previously unseen labels: [-2]


      What do I need to change to be able to look at the actual features themselves?










      share|improve this question














      I have a dataset that I am working with. I am converting them from categorical features to numerical features for my decision tree. The conversion happens on the entire data frame with the following lines:



      le = LE()
      df = df.apply(le.fit_transform)


      I later take this data and split it into training and testing data with the following:



      target = ['label']
      df_y = df['label']
      df_x = df.drop(target, axis=1)

      # Split into training and testing data
      train_x, test_x, train_y, test_y = tts(df_x, df_y, test_size=0.3, random_state=42)


      Then I am passing it to a method to train a decision tree:



      def Decision_Tree_Classifier(train_x, train_y, test_x, test_y, le):
      print " - Candidate: Decision Tree Classifier"
      dec_tree_classifier = DecisionTreeClassifier(random_state=0) # Load Module
      dec_tree_classifier.fit(train_x, train_y) # Fit
      accuracy = dec_tree_classifier.score(test_x, test_y) # Acc
      predicted = dec_tree_classifier.predict(test_x)
      mse = mean_squared_error(test_y, predicted)

      tree_feat = list(le.inverse_transform(dec_tree_classifier.tree_.feature))
      print "Tree Features:"
      print tree_feat
      print "Tree Thresholds:"
      print dec_tree_classifier.tree_.threshold

      scores = cross_val_score(dec_tree_classifier, test_x, test_y.values.ravel(), cv=10)
      return (accuracy, mse, scores.mean(), scores.std())


      In the above method, I am passing the LabelEncoder object originally used to encode the dataframe. I have the line



      tree_feat = list(le.inverse_transform(dec_tree_classifier.tree_.feature))


      To try and convert the features back to their original categoric representation, but I keep getting this stack trace error:



        File "<ipython-input-6-c2005f8661bc>", line 1, in <module>
      runfile('main.py', wdir='/Users/mydir)

      File "/Users/me/anaconda2/lib/python2.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 668, in runfile
      execfile(filename, namespace)

      File "/Users/me/anaconda2/lib/python2.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 100, in execfile
      builtins.execfile(filename, *where)

      File "/Users/me/mydir/main.py", line 125, in <module>
      main() # Run main routine

      File "candidates.py", line 175, in get_baseline
      dec_tre_acc = Decision_Tree_Classifier(train_x, train_y, test_x, test_y, le)

      File "candidates.py", line 40, in Decision_Tree_Classifier
      tree_feat = list(le.inverse_transform(dec_tree_classifier.tree_.feature))

      File "/Users/me/anaconda2/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 281, in inverse_transform
      "y contains previously unseen labels: %s" % str(diff))

      ValueError: y contains previously unseen labels: [-2]


      What do I need to change to be able to look at the actual features themselves?







      python scikit-learn decision-tree encoder






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 27 '18 at 20:47









      tushariyertushariyer

      2661216




      2661216
























          1 Answer
          1






          active

          oldest

          votes


















          1














          When you do this:



          df = df.apply(le.fit_transform)


          you are using a single LabelEncoder instance for all of your columns. When called fit() or fit_transform(), le will forget the previous data and learn the current data only. So the le you have is only storing the information about the last column it seen, not all columns.



          There are multiple ways to solve this:





          1. You can maintain multiple LabelEncoder objects (one for each column). See this excellent answer here:





            • Label encoding across multiple columns in scikit-learn



              from collections import defaultdict
              d = defaultdict(LabelEncoder)

              df = df.apply(lambda x: d[x.name].fit_transform(x))





          2. If you want to keep a single object to handle all columns, you can use the OrdinalEncoder if you have the latest version of scikit-learn installed.



            from sklearn.preprocessing import OrdinalEncoder
            enc = OrdinalEncoder()

            df = enc.fit_transform(df)



          But still the error will not be solved, because the tree_.feature dont correspond to values of the features, but the index (column in df) that was used for splitting at that node. So if you have 3 features (Columns) in the data (irrespective of values in that column), the tree_.feature can have values:




          • 0, 1, 2, -2


          • -2 is a special placeholder value to denote that the node is a leaf node, and so no feature is used to split anything.



          tree_.threshold will contain the values corresponding to your values of data. But that will be in floats, so there you will have to convert according the conversion of categories to numbers.



          See this example for understanding the tree structure in detail:




          • https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html






          share|improve this answer

























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53507902%2fhow-may-i-un-encode-the-features-from-a-decision-tree-to-see-the-important-featu%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            When you do this:



            df = df.apply(le.fit_transform)


            you are using a single LabelEncoder instance for all of your columns. When called fit() or fit_transform(), le will forget the previous data and learn the current data only. So the le you have is only storing the information about the last column it seen, not all columns.



            There are multiple ways to solve this:





            1. You can maintain multiple LabelEncoder objects (one for each column). See this excellent answer here:





              • Label encoding across multiple columns in scikit-learn



                from collections import defaultdict
                d = defaultdict(LabelEncoder)

                df = df.apply(lambda x: d[x.name].fit_transform(x))





            2. If you want to keep a single object to handle all columns, you can use the OrdinalEncoder if you have the latest version of scikit-learn installed.



              from sklearn.preprocessing import OrdinalEncoder
              enc = OrdinalEncoder()

              df = enc.fit_transform(df)



            But still the error will not be solved, because the tree_.feature dont correspond to values of the features, but the index (column in df) that was used for splitting at that node. So if you have 3 features (Columns) in the data (irrespective of values in that column), the tree_.feature can have values:




            • 0, 1, 2, -2


            • -2 is a special placeholder value to denote that the node is a leaf node, and so no feature is used to split anything.



            tree_.threshold will contain the values corresponding to your values of data. But that will be in floats, so there you will have to convert according the conversion of categories to numbers.



            See this example for understanding the tree structure in detail:




            • https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html






            share|improve this answer






























              1














              When you do this:



              df = df.apply(le.fit_transform)


              you are using a single LabelEncoder instance for all of your columns. When called fit() or fit_transform(), le will forget the previous data and learn the current data only. So the le you have is only storing the information about the last column it seen, not all columns.



              There are multiple ways to solve this:





              1. You can maintain multiple LabelEncoder objects (one for each column). See this excellent answer here:





                • Label encoding across multiple columns in scikit-learn



                  from collections import defaultdict
                  d = defaultdict(LabelEncoder)

                  df = df.apply(lambda x: d[x.name].fit_transform(x))





              2. If you want to keep a single object to handle all columns, you can use the OrdinalEncoder if you have the latest version of scikit-learn installed.



                from sklearn.preprocessing import OrdinalEncoder
                enc = OrdinalEncoder()

                df = enc.fit_transform(df)



              But still the error will not be solved, because the tree_.feature dont correspond to values of the features, but the index (column in df) that was used for splitting at that node. So if you have 3 features (Columns) in the data (irrespective of values in that column), the tree_.feature can have values:




              • 0, 1, 2, -2


              • -2 is a special placeholder value to denote that the node is a leaf node, and so no feature is used to split anything.



              tree_.threshold will contain the values corresponding to your values of data. But that will be in floats, so there you will have to convert according the conversion of categories to numbers.



              See this example for understanding the tree structure in detail:




              • https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html






              share|improve this answer




























                1












                1








                1







                When you do this:



                df = df.apply(le.fit_transform)


                you are using a single LabelEncoder instance for all of your columns. When called fit() or fit_transform(), le will forget the previous data and learn the current data only. So the le you have is only storing the information about the last column it seen, not all columns.



                There are multiple ways to solve this:





                1. You can maintain multiple LabelEncoder objects (one for each column). See this excellent answer here:





                  • Label encoding across multiple columns in scikit-learn



                    from collections import defaultdict
                    d = defaultdict(LabelEncoder)

                    df = df.apply(lambda x: d[x.name].fit_transform(x))





                2. If you want to keep a single object to handle all columns, you can use the OrdinalEncoder if you have the latest version of scikit-learn installed.



                  from sklearn.preprocessing import OrdinalEncoder
                  enc = OrdinalEncoder()

                  df = enc.fit_transform(df)



                But still the error will not be solved, because the tree_.feature dont correspond to values of the features, but the index (column in df) that was used for splitting at that node. So if you have 3 features (Columns) in the data (irrespective of values in that column), the tree_.feature can have values:




                • 0, 1, 2, -2


                • -2 is a special placeholder value to denote that the node is a leaf node, and so no feature is used to split anything.



                tree_.threshold will contain the values corresponding to your values of data. But that will be in floats, so there you will have to convert according the conversion of categories to numbers.



                See this example for understanding the tree structure in detail:




                • https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html






                share|improve this answer















                When you do this:



                df = df.apply(le.fit_transform)


                you are using a single LabelEncoder instance for all of your columns. When called fit() or fit_transform(), le will forget the previous data and learn the current data only. So the le you have is only storing the information about the last column it seen, not all columns.



                There are multiple ways to solve this:





                1. You can maintain multiple LabelEncoder objects (one for each column). See this excellent answer here:





                  • Label encoding across multiple columns in scikit-learn



                    from collections import defaultdict
                    d = defaultdict(LabelEncoder)

                    df = df.apply(lambda x: d[x.name].fit_transform(x))





                2. If you want to keep a single object to handle all columns, you can use the OrdinalEncoder if you have the latest version of scikit-learn installed.



                  from sklearn.preprocessing import OrdinalEncoder
                  enc = OrdinalEncoder()

                  df = enc.fit_transform(df)



                But still the error will not be solved, because the tree_.feature dont correspond to values of the features, but the index (column in df) that was used for splitting at that node. So if you have 3 features (Columns) in the data (irrespective of values in that column), the tree_.feature can have values:




                • 0, 1, 2, -2


                • -2 is a special placeholder value to denote that the node is a leaf node, and so no feature is used to split anything.



                tree_.threshold will contain the values corresponding to your values of data. But that will be in floats, so there you will have to convert according the conversion of categories to numbers.



                See this example for understanding the tree structure in detail:




                • https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Nov 28 '18 at 8:02

























                answered Nov 28 '18 at 7:35









                Vivek KumarVivek Kumar

                16.4k42155




                16.4k42155
































                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53507902%2fhow-may-i-un-encode-the-features-from-a-decision-tree-to-see-the-important-featu%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

                    Calculate evaluation metrics using cross_val_predict sklearn

                    Insert data from modal to MySQL (multiple modal on website)