# Python Text Classification

This sample shows how to:
- Build a dataset with contents retrieved from **CosmosDB**
- Train a text classification model in Python using **sklearn** and **SpaCy**
- Use a model stored in **Azure Blob Storage** to get a prediction

The modules *model_helpers*, *text_processing* and *azure_storage_helpers* in this repo can be used independently.

## Run the sample

1. Install the following libraries used if you don't already have them installed by running `pip install {name of library}`:
    - spacy
    - nltk
    - azure-cosmos
    - azure-storage-blob
    - pandas
    - sklearn
    
2. Make sure you have a CosmosDB database set up with data in it

    Dataset is to come, but the expected format in this sample is:
    ```
    {...
     pages:[{
            ...
            sections:[{
                      ...
                      text:''
                      label:''
                      },
                      ...
                      ]
           },
           ...
           ]
    }
    ```
    You can of course use your own data, in this case make some changes to the dataset building in *model_train.py*
  
3. Make sure you have an Azure Storage account set up

4. In *model_train.py*, replace the values in **cosmosConfig** and **blobConfig** with your own

5. Run `python .\model_train.py`

6. In *get_predict.py*, replace the values in **blobConfig** with your own

7. In *get_predict.py*, replace the text to get a prediction for

8. Run `python .\get_predict.py`


## Use the modules independently

- If you need to upload/retrieve files from CosmosDB or Azure Blob Storage in python, *azure_storage_helpers.py* is all you need

- If you want to perform text processing (normalize text and remove stop words), you can use *text_processing.py*

- If you want to train a classification model using sklearn, *model_helpers* is what you're looking for