Nathan States - Bengaluru Text Mining

Results

Over 80% of complaints filed were related to street lights not working, road maintenance, or garbage collection issues.
Of the four classification models, the linear support vector classifier performed the best, recording 88.773% accuracy. The top three models all performed similarly as well, though, all falling within three percentage points.
Improvements were made by increasing the number of stop words, as well as combining smaller categories into larger ones. Using LinearSVC, these changes led to a 3.214% increase, ultimately recording an accuracy of 91.987%.
Further improvements can be made by adding to the stop word list, changing the contents of the “Others” category, and adjusting the downsampling of the model.

Background

Text data can be difficult to analyze in large amounts, but raw text is invaluable in numerous different ways. Using simple Python libraries, modern machine learning models can parse thousands of rows in seconds, which can be used for a variety of purposes. One of the most common of these is classification, or categorizing text into different groups.

The Bruhat Bengaluru Mahangara Palike (BBMP) - an administrative body that oversees city development in Bengaluru, the largest tech city in India - created a web application that allows citizens to file grievances with the city. From February 8th, 2020 to February 21st, 2021, a total of 105,956 complaints were filed to BBMP, or about 280 a day. Exploring this data not only provides insight into the most common problems facing this city (or at least the complaints most likely to be sent), but also presents an opportunity to quantify and categorize them.

In the dataset, complaints have been manually categorized by the administrators who oversee the app at BBMP, but this is extremely inefficient. Usefully, though, the developers have already created categories that they felt best sorted the data, which means, assuming complaints don’t change substantially in the future, we can train a machine learning model that performs this task automatically. This could save hours and hours of time.

Because the categories are already defined, this will be a supervised classification model.

Data Wrangling

Data wrangling and EDA done using R.

First, we import the data and the libraries we will be using.

Show code

# Load Libraries
library(tidyverse) 
library(tidymodels)
library(tidytext)
library(echarts4r)
library(here)
library(lubridate)
library(wordcloud2)
library(reticulate)

# Set Directory 
here::set_here()

# Import Data
grievances <- readr::read_csv("bengaluru-grievances.csv")

The admins at BBMP keep their data neat and tidy, so there’s not many problems to fix. There are a couple things to consider, however.

The first issue is that some descriptions are extremely short, making classifying them accurately near impossible. We can limit the number of characters that a complaint must have, though the appropriate number of rows to remove is debatable. Ideally, we don’t want to exclude too much of the data while still removing descriptions too short to be properly categorized.

Using str_length from the stringr package, we see 5,392 complaints contained fewer than 12 characters, meaning removing them would still preserve 95% of the original data. Using filter from dplyr, we can keep all complaints containing more than 12 characters.

Show code

# Counting number of rows 
grievances %>%
  filter(str_length(description) < 12) %>%
  count()

# A tibble: 1 x 1
      n
  <int>
1  5392

Show code

# Removing those rows 
grievances <- grievances %>%
  filter(str_length(description) > 12)

The other consideration is whether the existing categories accurately reflect the data or not. It’s possible certain similarities between different categories would better be combined into one, and likewise, single categories that should be multiple ones. These changes might not only provide a better description of the data, but improve accuracy in the long run.

For now, we will leave the original categories intact and proceed, but future models may benefit from this step.

Exploratory Data Analysis

Interactive charts created using echarts4r.

Number of Grievances By Day

Show code

# Set Theme 
e_common(
  font_family = "Georgia"
)

# Chart 
grievances %>%
  group_by(created_at = as.Date(created_at)) %>%
  summarise(Total = n()) %>%
  e_charts(created_at) %>%
  e_line(Total, symbol = "none") %>%
  e_x_axis(axisLabel = list(interval = 0)) %>%
  e_title(
    text = "Total number of grievances by day",
    subtext = "Data: BBMP"
  ) %>%
  e_color(
    "#0a32d2", 
    background = "rgb(0, 0, 0, 0)"
  ) %>%
  e_legend(show = FALSE) %>%
  e_tooltip(
    backgroundColor = "rgba(20, 20, 20, 0.5)",
    borderColor = "rgba(0, 0, 0, 0)",
    textStyle = list(
      color = "#ffffff"
    ),
    trigger = "axis"
  )

On most days, grievances would vary between 200 to 400 a day, with some spikes in September onwards, including a massive single-day one in March of 742. On average, 280 complaints were filed each day.

Because complaints don’t seem to fluctuate significantly, we can likely assume that the day the grievance was filed isn’t indicative of its contents.

Grievances by Category

Show code

grievances %>%
  group_by(category) %>%
  summarise(Total = n()) %>%
  arrange(desc(Total)) %>%
  filter(Total > 1000) %>%
  slice(1:10) %>%
  e_charts(category) %>%
  e_bar(Total) %>%
  e_x_axis(
    axisLabel = list(
      interval = 0, 
      rotate = 45,
      fontSize = 9.25
    )
  ) %>%
  e_title(
    text = "Most common grievances by category", 
    subtext = "Only categories above 1,000 visible"
  ) %>%
  e_legend(show = FALSE) %>%
  e_labels(show = FALSE) %>%
  e_color(
    "#8c5ac8", 
    background = "rgb(0, 0, 0, 0)"
  ) %>%
  e_tooltip(
    trigger = "axis",
    backgroundColor = "rgba(20, 20, 20, 0.5)",
    borderColor = "rgba(0, 0, 0, 0)",
    textStyle = list(
      color = "#ffffff"
    )
  )

81.89% of total grievances were categorized as electrical, solid waste or garbage related, or road maintenance. The next three largest categories make up an additional 11.09%, meaning the remaining categories have less than 1,350 occurrences combined.

Note that this is not due to a lack of categories; in fact, there are a total of 20 categories in the data.

Show code

length(unique(grievances$category))

[1] 20

Large imbalances like this are important to consider when building machine learning models. Algorithms are specifically programmed to achieve the highest accuracy regardless of original purposes, and they tend to overestimate larger categories. If a model discovers it can restrict itself to three options while still recording 80%+ accuracy, it will almost always do so.

This means, though, machine learning models will tend to ignore minor categories, because - using our current data as an example - predicting a category outside the top three has an inherent 81.89% fail rate, so this will need to be addressed when creating the models.

There is a question as to whether certain smaller categories should exist at all, though.

Show code

grievances %>%
  group_by(category) %>%
  summarise(total = n()) %>%
  filter(total < 100)

# A tibble: 5 x 2
  category                   total
  <chr>                      <int>
1 Education                     20
2 Estate                        75
3 Markets                       38
4 Optical Fiber Cables (OFC)    62
5 Welfare Schemes               28

Five categories have less than 100 complaints total, including two which have less than thirty. This is far too few complaints to reliably build a model with, especially considering we haven’t split the data yet.

Grievances by Subcategory

Show code

grievances %>%
  group_by(subcategory) %>%
  summarise(Total = n()) %>%
  arrange(Total) %>%
  filter(Total > 1000) %>%
  e_charts(subcategory) %>%
  e_bar(Total) %>%
  e_legend(show = FALSE) %>%
  e_title(
    text = "Most common grievances by subcategory", 
    subtext = "Only categories above 1,000 visible"
    ) %>%
  e_color("#8c5ac8", background = "rgb(0,0,0,0)") %>%
  e_flip_coords() %>%
  e_tooltip(
    trigger = "axis",
    backgroundColor = "rgba(20, 20, 20, 0.5)",
    borderColor = "rgba(0, 0, 0, 0)",
    textStyle = list(
      color = "#ffffff"
    )
  )

While the model being built is only to predict the main categories, by viewing subcategories, we see over 35% of total complaints were related specifically to street lights not working, comprising almost all of the electrical category. Solid waste related problems is divided into two subcategories; garbage vehicle not arriving, and “Garbage dump” (whatever that means). Meanwhile, road maintenance has been divided into three subcategories, those being potholes, road side drains, and debris removal.

Most Common Complaint Words

We first have to get the 50 most common words…

Show code

tidy <- grievances %>%
  unnest_tokens(word, description) %>%
  anti_join(get_stopwords()) %>%
  count(word) %>%
  arrange(desc(n)) %>%
  slice(1:50)

Then chart them using wordcloud2.

Show code

wordcloud2(
  tidy,
  color = rep_len(c("#8c5ac8", "#0b0d21", "#0a32d2"), nrow(tidy)),
  backgroundColor = "#fcfcfc"
)

Seeing as over a third of the data was subcategorized as street lights not working, we see that they are the most common words across all the complaints.

Missing from that is the word not, but this is because we removed all stop words from the data. Put simply, stop words are words that don’t add anything to the process of categorizing text. They usually include words like he, she, there, they, I, and so on, which is why they don’t appear in the chart.

The stop words included in the tidytext package (and the ones used in the function above) are meant to apply universally, but if you look closely at the word cloud, there are several words that almost certainly don’t apply to our problem. These include terms like please, kindly, last request, sir, and individual numbers like 1, 2, and 3. While there might exist some incidental correlation between some of these words and their respective categories (perhaps citizens filing animal control complaints are nicer on average, so the term please could be used to identify those complaints more accurately, for example), it’s likely this will just throw off our model’s accuracy in the long run.

From the EDA, the primary factors that should be considered when building the models are:

Account for imbalances in number of occurrences per category.
Reduce the number of categories by combining them into existing ones.
Add additional stop words.

MLE Models

Preparing the Data

We’ll opt to use Python for creating the MLE models, as Python libraries are generally more efficient and developed than their R counterparts. Because creating models is computer intensive, the code here has been evaluated locally and presented here for demonstration.

To start, we’ll import the necessary libraries and load the data using pandas. The models here will be built using sklearn.

Show code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns  
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn import metrics

df = pd.read_csv("bengaluru-grievances.csv")

One of the packages imported here is TfidfVectorizer, which will be the algorithm used to create the models. I’ll explain why I specifically chose this package later on.

Here, I quickly apply the earlier data wrangling techniques by removing complaints less than 12 characters, this time using Python syntax.

Show code

# Reduce Character Limit  
character_limit = (df["description"].str.len() > 12)
df = df.loc[character_limit]

A model based on over 100,000 observations is extremely hardware intensive, and it causes my laptop to overheat a lot. For practical purposes, we’ll cut down on our data by choosing 15,000 rows at random.

Show code

df = df.sample(15000, random_state = 1).copy()

If we wanted to get another random 15,000 rows, we could change the random_state = 1 argument to any other number, like 42, 671, or 7.

With no other changes to be made, we can begin creating the models.

Text Preprocessing

There are several methods to building models, but the simplest method is to create a new column in our data - we’ll call it category_id - that is a factor variable of all existing categories in our data. This essentially amounts to assigning each category a number (Electrical = 0, Road Engineering = 1, etc), which is necessary for getting our model to run properly, as sklearn will not understand strings as factors.

Show code

df['category_id'] = df["category"].factorize()[0]
category_id_df = df[["category", "category_id"]].drop_duplicates()

Next, the description column (which stores the text for grievances) needs to be converted to vectors using a chosen algorithm. The algorithm chosen here is Term Frequency - Inverse Document Frequency (TF-IDF), which is the product of $TF$ and $IDF$ scores. This is the TfidfVectorizer function that we imported earlier.

It’s useful to present these terms mathematically.

Term Frequency: \[ TF = \frac{Number \hspace{0.15cm} of \hspace{0.15cm} times \hspace{0.15cm} a \hspace{0.15cm} term \hspace{0.15cm} appears \hspace{0.15cm} in \hspace{0.15cm} the \hspace{0.15cm} description}{Total \hspace{0.15cm} number \hspace{0.15cm} of \hspace{0.15cm} words \hspace{0.15cm} in \hspace{0.15cm} the \hspace{0.15cm} description} \]

Inverse Document Frequency: \[ IDF = log(\frac{Number \hspace{0.15cm} of \hspace{0.15cm} rows \hspace{0.15cm} in \hspace{0.15cm} a \hspace{0.15cm} data}{Number \hspace{0.15cm} of \hspace{0.15cm} rows \hspace{0.15cm} a \hspace{0.15cm} term \hspace{0.15cm} appears \hspace{0.15cm} in \hspace{0.15cm} a \hspace{0.15cm} data}) \]

TF-IDF: \[ TF-IDF = TF * IDF \]

The reason for choosing this algorithm was for the $IDF$ component, which downsamples words that appear frequently across all complaints, while adding extra weight to terms that appear less often.

Analyzing it mathematically: as the denominator of the $IDF$ variable increases, the closer it rapidly (more precisely, exponentially) approaches zero. Let $N$ represent the total number of rows in the data, and let $t$ represent a chosen term. If a certain word were to appear in every single complaint, then we would have $IDF(N, t) = log(\frac{N}{t}) = log(\frac{N}{N}) = log(1) = 0$, which would mean that when calculating $TF-IDF$, that specific word would have absolutely no weight attached to it when classifying complaints.

Now; why do this? As discussed previously during the EDA section, almost 82% of all complaints fell into exactly three categories. Classification models will tend to stick to only a few categories, struggling to identify minor categories. While this may record higher accuracy scores on average, doing so means minor categories will rarely be classified at all, and in some instances, could lower overall accuracy if skew is significant enough. By ranking terms on an exponentially decreasing scale, we hope to reduce this issue.

We first setup our TfidfVectorizer and assign it to a variable, tfidf.

Show code

tfidf = TfidfVectorizer(
  sublinear_tf = True, # Set term frequency to logarithmic scale
  min_df = 5, # Remove terms that appear less than 'min_df' times
  ngram_range = (1, 2), # Unigrams and bigrams are considered 
  stop_words = 'english' # Use common English stop words
)

From here, we can begin building our models, but before doing so, let’s see what the most common terms were for each category.

To do so, we use tfidf.get_feature_names_out() on each category and assign that to a variable that we’ll call feature_names. This contains all of the most common words associated with each category, which we then split into two separate lists for unigrams and bigrams (fancy words for “one word” and “two words”). From there, we print to console the $N = 3$ most common terms from each list. We wrap all this in a for loop, automatically progressing through each category.

Show code

# Defaults 
features = tfidf.fit_transform(df.description).toarray()
labels = df.category_id
N = 3

# For Loop
for category, category_id in sorted(category_to_id.items()):
  features_chi2 = chi2(features, labels == category_id)
  indices = np.argsort(features_chi2[0])
  feature_names = np.array(tfidf.get_feature_names_out())[indices]
  unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
  bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
  print("\n==> %s:" %(Product)) # Space for formatting
  print("  * Most Correlated Unigrams are: %s" %(', '.join(unigrams[-N:])))
  print("  * Most Correlated Bigrams are: %s" %(', '.join(bigrams[-N:])))

As this part was performed offline, here is the output in screenshots.

The results are largely what we would expect, though there are some things to note.

Some common phrases that appear for certain categories seemingly have nothing to do with them, such as the terms “77”, “kindly”, and “plz look”, which is one of the most common bigrams for both “Education” and “Welfare Schemes.” Remember, these were the categories that had less than 100 observations total. When we split the data to grab 15,000 random rows, these categories were split even further, which is probably why these nonsense phrases appear.

Building the Models

To begin, we first split the data into a 75:25 training and test split. The model will “learn” how to classify grievances based on the training data, and then it will “test” its accuracy on the remaining 25%.

Show code

# We define them here as independent variables 
X = df["description"]
y = df["category"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

There are several different models to choose from, but it’s hard to know which will perform best before actually building them. That’s why we’ll test several models simultaneously by storing them in a list and looping through each model.

Show code

models = [
    RandomForestClassifier(n_estimators = 100, max_depth = 5, random_state = 0),
    LinearSVC(),
    MultinomialNB(),
    LogisticRegression(random_state = 0)    
]

Here, we stored in a list the following:

Random Forest Model
Linear Support Vector Classifier Model
Multinomial Naive Bayes Model
Logistic Regression Model

After, we apply each model to the training data and record the results. The accuracy of each model is inherently random, as model performance is somewhat due to chance, so we’ll use a five-fold cross-validation and take the mean average of each iteration to get a more balanced result. We store the results in a pandas dataframe for analysis.

Show code

# Copy and pasted from before 
features = tfidf.fit_transform(df.description).toarray()
labels = df.category_id

CV = 5 # Number of cross-validations
cv_df = pd.DataFrame(index = range(CV * len(models))) # CV dataframe
entries = [] # Array for storing model results 

for model in models:
  model_name = model.__class__.__name__
  accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
  for fold_idx, accuracy in enumerate(accuracies):
    entries.append((model_name, fold_idx, accuracy))
    
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])

Results

Because we used a five-fold cross-validation, we have a total of 20 accuracy results - five for each model. We grab the mean accuracy and standard deviation for each model, storing them into a list.

Show code

mean_accuracy = cv_df.groupby('model_name').accuracy.mean()
std_accuracy = cv_df.groupby('model_name').accuracy.std()

accuracy = pd.concat([mean_accuracy, std_accuracy], axis= 1, ignore_index=True)
accuracy.columns = ['Mean Accuracy', 'Standard Deviation']

accuracy

Model	Mean Accuracy	Standard Deviation
Linear SVC	88.773%	0.368%
Logistic Regression	87.767%	0.433%
Multinomial NB	85.720%	0.117%
Random Forest	66.213%	1.411%

The top three models all performed similarly as well, all falling within 3.1% percentage points. The Linear Support Vector Classifier performed the best among the three, while the Random Forest performed atrociously. Standard deviation among the top three remained fairly low, but especially for multinomial naive bayes.

Improvements

We will focus model improvement on the Linear SVC because it performed the best.

As a reminder, these were the three main considerations before going in.

Account for imbalances in number of occurrences per category.
Consider reducing number of categories.
Consider adding additional stopwords.

To get a better idea of how our model performed, we will plot a confusion matrix, which displays the total number of attempts our classification model made along with how many were accurately categorized.

Show code

# Recreating LinearSVC Model 
X_train, X_test, y_train, y_test,indices_train,indices_test = train_test_split(features, 
                                                               labels, 
                                                               df.index, test_size=0.25, 
                                                               random_state=1)
model = LinearSVC()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Confusion Matrix Plot
conf_mat = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize = (8, 8))
sns.heatmap(conf_mat, annot = True, cmap = "Greens", fmt = 'd',
            xticklabels = category_id_df.category.values, 
            yticklabels = category_id_df.category.values)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix for LinearSVC \n", size = 20)

On the diagonal are the number of rows that the model correctly predicted for each category. The horizontals and verticals represent the number of incorrect guesses, with the vertical representing incorrect guesses for that specific category. For example, the model correctly classified 1,484 complaints as “Electrical,” incorrectly classified 2 as “Electrical” when they should of been classified as “COVID-19,” and classified 10 as “Road Maintenance” when they should of been classified as “Electrical.”

Modifications

Looking at the chart, the top three categories dominate the total number of occurrences, comprising 2,969 rows out of 3,750 in our test data. Most of the incorrect predictions appear in the vertical of each of these three columns, meaning the model was incorrectly classifying complaints as them often. Even though we chose an algorithm to specifically downsample those categories, our model still has a tendency to over-predict them.

A few categories have 12 or fewer observations: those being Markets, Estate, OFC, Welfare Schemes, Advertisement, Education, Town Planning, Lakes, and Parks and Playgrounds. Converting these will likely improve accuracy considering how poorly our model did at predicting them, but there isn’t clear category to merge them with. Many of these categories seem to have been falsely labeled as “Road Maintenance.” While converting these columns over to this might lead to higher accuracy, it doesn’t really make any sense in this case, and likely would hurt performance in the future.

We could reassign these variables to “Others,” but that category performed abysmally, only correctly predicting 3 out of 43 complaints. On one hand, moving them there probably won’t hurt, but it likely won’t improve “Others” result either.

Lakes and Advertisements, which the model predicted quite a few correctly, will be left untouched for now. For the remaining categories under 12 test observations, they will be merged in with Others.

Show code

# Read in Data | Copy and Paste from Above
df = pd.read_csv("bengaluru-grievances.csv")
pd.DataFrame(df.category.unique()).values 
character_limit = (df["description"].str.len() > 12)
df = df.loc[character_limit]
df = df.sample(15000, random_state = 2).copy() # Select New Rows 

# Convert Columns 
df["category"] = df["category"].replace({'Markets': 'Others'})
df["category"] = df["category"].replace({'Estate': 'Others'})
df["category"] = df["category"].replace({'Welfare Schemes': 'Others'})
df["category"] = df["category"].replace({'Education': 'Others'})
df["category"] = df["category"].replace({'Town Planning': 'Others'})

“Optical Fiber Cables” and “Storm Water Drains,” though, are directly related to “Road Maintenance,” and if we look at the chart, that’s what the model ended up incorrectly guessing the most of. For these categories, it makes more sense to convert them over to “Road Maintenance” as opposed to “Others.”

Show code

df["category"] = df["category"].replace({'Optical Fiber Cables (OFC)': 'Road Maintenance(Engg)'})
df["category"] = df["category"].replace({'Storm  Water Drain(SWD)': 'Road Maintenance(Engg)'})

While we’ve converted the total number of categories down from 20 to 13, we’ve only changed a total of 45 test rows. Even if the improved model were able to correctly predict all these observations now, we would only see an improvement of 1.2%. It’s certainly not insignificant, but hardly substantial. Improving our stop word list, on the other hand, will hypothetically improve the accuracy of the model overall.

Recall earlier when we found the most common unigrams and bigrams for each category. Several terms that appeared most often had little or nothing to do with their respective grievances, and should be able to be removed while maintaining or improving original accuracy.

The original stop word list comes from another function in sklearn, and already contains over 300 words. We want to keep those words while adding to it, so we will union them together in a new list and use it in TfidfVectorizer.

Show code

# Import Function 
from sklearn.feature_extraction import text

# Add Stop Words 
stop_words = text.ENGLISH_STOP_WORDS.union(["please", "plz", "look", "help", "causing", "coming", "kindly", "refused", "senior", "help", "one", "two", "three", "also", "77", "1", "2", "3", "since"])

# TfidfVectorizer 
tfidf = TfidfVectorizer(
  sublinear_tf = True, # Set term frequency to logarithmic scale
  min_df = 5, # Remove terms that appear less than 'min_df' times
  ngram_range = (1, 2), # Keep unigrams and bigrams
  stop_words = stop_words # Use custom stop words 
)

Redo Text Preprocessing

We have to redo the text preprocessing from earlier, so this is all copy-and-paste from before.

Show code

# Copy and Paste
df['category_id'] = df["category"].factorize()[0]
category_id_df = df[["category", "category_id"]].drop_duplicates()
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[["category_id", "category"]].values)

features = tfidf.fit_transform(df.description).toarray()
labels = df.category_id
X = df["description"]
y = df["category"]

X_train, X_test, y_train, y_test,indices_train,indices_test = train_test_split(
  features, 
  labels, 
  df.index, test_size = 0.25, 
  random_state = 2)
model = LinearSVC()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

New Results

We don’t need the complicated for loop from before because we only have one model this time. Therefore, we simply use cross_val_score as we did before and print the results to console.

Show code

accuracy_svc = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)

cv_mean_accuracy_svc = accuracy_svc.mean()
cv_mean_std_svc = accuracy_svc.std()

print(cv_mean_accuracy_svc * 100)
print(cv_mean_std_svc * 100)

Our new Linear SVC model was able to achieve 91.987% accuracy with an average standard deviation of 0.282% across five iterations. That’s an improvement of 3.214% while also reducing variance within model performance by 0.086%.

Another way to look at these refinements; out of a possible 3,750 complaints, our model was able to correctly classify an additional 120 complaints, going from 3,328 correct predictions to 3,449.

Conclusions

Once again, the top three categories performed similarly as well as before. Road maintenance was able to correctly predict an additional 63 complaints on this iteration. These three categories also continue to make up most of the incorrect predictions.

The “Others” categories once again performed dreadfully, only recording an additional two correct predictions despite even more chances. Given it’s a category meant to be all-emcompassing, it probably makes sense to manually reclassify those comaplaints into new or existing categories.

Minor categories saw little or no improvement. The “Health Dept” was the only category that performed worse, dropping from 69.7% to 54.6% accuracy. The model incorrectly chose “Solid Waste” and “Road Maintenance” much more often than the previous model did, though it’s unclear as to why this is.

Increasing the stop word list seems to have improved accuracy overall. A more thorough list and additional adjustments might boost performance slightly more.

Source

You can check out the Python source code for the MLE model here. Download the full dataset here.

--- title: "Bengaluru Text Mining" description: 'Analyzing over 100,000 complaints from the Silicon Valley of India.' author: "Nathan States" date: "01/22/2023" categories: [text_mining, machine_learning] format: html: code-fold: true code-summary: "Show code" toc: true toc-location: right code-tools: true --- ![Aerial view of *Bengaluru*, India.](_images/preview.jpg) <br> ## Results * Over **80%** of complaints filed were related to street lights not working, road maintenance, or garbage collection issues. * Of the four classification models, the linear support vector classifier performed the best, recording **88.773%** accuracy. The top three models all performed similarly as well, though, all falling within three percentage points. * Improvements were made by increasing the number of stop words, as well as combining smaller categories into larger ones. Using LinearSVC, these changes led to a 3.214% increase, ultimately recording an accuracy of **91.987%**. * Further improvements can be made by adding to the stop word list, changing the contents of the "Others" category, and adjusting the downsampling of the model. ## Background Text data can be difficult to analyze in large amounts, but raw text is invaluable in numerous different ways. Using simple Python libraries, modern machine learning models can parse thousands of rows in seconds, which can be used for a variety of purposes. One of the most common of these is *classification*, or categorizing text into different groups. The *Bruhat Bengaluru Mahangara Palike* (BBMP) - an administrative body that oversees city development in Bengaluru, the largest tech city in India - created a web application that allows citizens to file grievances with the city. From February 8th, 2020 to February 21st, 2021, a total of **105,956** complaints were filed to BBMP, or about 280 a day. Exploring this data not only provides insight into the most common problems facing this city (or at least the complaints most likely to be sent), but also presents an opportunity to quantify and categorize them. In the dataset, complaints have been manually categorized by the administrators who oversee the app at BBMP, but this is extremely inefficient. Usefully, though, the developers have already created categories that they felt best sorted the data, which means, assuming complaints don't change substantially in the future, we can train a machine learning model that performs this task automatically. This could save hours and hours of time. Because the categories are already defined, this will be a **supervised** classification model. ## Data Wrangling Data wrangling and EDA done using `R`. First, we import the data and the libraries we will be using. ```{r echo=TRUE, message=FALSE, warning=FALSE} # Load Libraries library(tidyverse) library(tidymodels) library(tidytext) library(echarts4r) library(here) library(lubridate) library(wordcloud2) library(reticulate) # Set Directory here::set_here() # Import Data grievances <- readr::read_csv("bengaluru-grievances.csv") ``` The admins at BBMP keep their data neat and tidy, so there's not many problems to fix. There are a couple things to consider, however. The first issue is that some descriptions are extremely short, making classifying them accurately near impossible. We can limit the number of characters that a complaint must have, though the appropriate number of rows to remove is debatable. Ideally, we don't want to exclude too much of the data while still removing descriptions too short to be properly categorized. Using `str_length` from the `stringr` package, we see 5,392 complaints contained fewer than 12 characters, meaning removing them would still preserve 95% of the original data. Using `filter` from `dplyr`, we can keep all complaints containing more than 12 characters. ```{r, echo=TRUE, message=FALSE, warning=FALSE} # Counting number of rows grievances %>% filter(str_length(description) < 12) %>% count() ``` ```{r, echo=TRUE, message=FALSE, warning=FALSE} # Removing those rows grievances <- grievances %>% filter(str_length(description) > 12) ``` The other consideration is whether the existing categories accurately reflect the data or not. It's possible certain similarities between different categories would better be combined into one, and likewise, single categories that should be multiple ones. These changes might not only provide a better description of the data, but improve accuracy in the long run. For now, we will leave the original categories intact and proceed, but future models may benefit from this step. ## Exploratory Data Analysis Interactive charts created using `echarts4r`. ### Number of Grievances By Day ```{r, echo=TRUE, message=FALSE, warning=FALSE} # Set Theme e_common( font_family = "Georgia" ) # Chart grievances %>% group_by(created_at = as.Date(created_at)) %>% summarise(Total = n()) %>% e_charts(created_at) %>% e_line(Total, symbol = "none") %>% e_x_axis(axisLabel = list(interval = 0)) %>% e_title( text = "Total number of grievances by day", subtext = "Data: BBMP" ) %>% e_color( "#0a32d2", background = "rgb(0, 0, 0, 0)" ) %>% e_legend(show = FALSE) %>% e_tooltip( backgroundColor = "rgba(20, 20, 20, 0.5)", borderColor = "rgba(0, 0, 0, 0)", textStyle = list( color = "#ffffff" ), trigger = "axis" ) ``` On most days, grievances would vary between 200 to 400 a day, with some spikes in September onwards, including a massive single-day one in March of 742. On average, 280 complaints were filed each day. Because complaints don't seem to fluctuate significantly, we can likely assume that the day the grievance was filed isn't indicative of its contents. ### Grievances by Category ```{r, echo=TRUE} grievances %>% group_by(category) %>% summarise(Total = n()) %>% arrange(desc(Total)) %>% filter(Total > 1000) %>% slice(1:10) %>% e_charts(category) %>% e_bar(Total) %>% e_x_axis( axisLabel = list( interval = 0, rotate = 45, fontSize = 9.25 ) ) %>% e_title( text = "Most common grievances by category", subtext = "Only categories above 1,000 visible" ) %>% e_legend(show = FALSE) %>% e_labels(show = FALSE) %>% e_color( "#8c5ac8", background = "rgb(0, 0, 0, 0)" ) %>% e_tooltip( trigger = "axis", backgroundColor = "rgba(20, 20, 20, 0.5)", borderColor = "rgba(0, 0, 0, 0)", textStyle = list( color = "#ffffff" ) ) ``` 81.89% of total grievances were categorized as electrical, solid waste or garbage related, or road maintenance. The next three largest categories make up an additional 11.09%, meaning the remaining categories have less than 1,350 occurrences combined. Note that this is *not* due to a lack of categories; in fact, there are a total of 20 categories in the data. ```{r} length(unique(grievances$category)) ``` Large imbalances like this are important to consider when building machine learning models. Algorithms are specifically programmed to achieve the highest accuracy regardless of original purposes, and they tend to overestimate larger categories. If a model discovers it can restrict itself to three options while still recording 80%+ accuracy, it will almost always do so. This means, though, machine learning models will tend to ignore minor categories, because - using our current data as an example - predicting a category outside the top three has an inherent 81.89% *fail* rate, so this will need to be addressed when creating the models. There is a question as to whether certain smaller categories should exist at **all**, though. ```{r} grievances %>% group_by(category) %>% summarise(total = n()) %>% filter(total < 100) ``` Five categories have less than 100 complaints total, including two which have less than thirty. This is far too few complaints to reliably build a model with, especially considering we haven't split the data yet. ### Grievances by Subcategory ```{r} grievances %>% group_by(subcategory) %>% summarise(Total = n()) %>% arrange(Total) %>% filter(Total > 1000) %>% e_charts(subcategory) %>% e_bar(Total) %>% e_legend(show = FALSE) %>% e_title( text = "Most common grievances by subcategory", subtext = "Only categories above 1,000 visible" ) %>% e_color("#8c5ac8", background = "rgb(0,0,0,0)") %>% e_flip_coords() %>% e_tooltip( trigger = "axis", backgroundColor = "rgba(20, 20, 20, 0.5)", borderColor = "rgba(0, 0, 0, 0)", textStyle = list( color = "#ffffff" ) ) ``` While the model being built is only to predict the main categories, by viewing subcategories, we see over 35% of total complaints were related specifically to street lights not working, comprising almost all of the electrical category. Solid waste related problems is divided into two subcategories; garbage vehicle not arriving, and "Garbage dump" (whatever that means). Meanwhile, road maintenance has been divided into three subcategories, those being potholes, road side drains, and debris removal. ### Most Common Complaint Words We first have to get the 50 most common words... ```{r, message=FALSE, warning=FALSE} tidy <- grievances %>% unnest_tokens(word, description) %>% anti_join(get_stopwords()) %>% count(word) %>% arrange(desc(n)) %>% slice(1:50) ``` Then chart them using `wordcloud2`. ```{r} wordcloud2( tidy, color = rep_len(c("#8c5ac8", "#0b0d21", "#0a32d2"), nrow(tidy)), backgroundColor = "#fcfcfc" ) ``` Seeing as over a third of the data was subcategorized as street lights not working, we see that they are the most common words across all the complaints. Missing from that is the word *not*, but this is because we removed all **stop words** from the data. Put simply, stop words are words that don't add anything to the process of categorizing text. They usually include words like *he*, *she*, *there*, *they*, *I*, and so on, which is why they don't appear in the chart. The stop words included in the `tidytext` package (and the ones used in the function above) are meant to apply universally, but if you look closely at the word cloud, there are several words that almost certainly don't apply to our problem. These include terms like *please*, *kindly*, *last request*, *sir*, and individual numbers like *1*, *2*, and *3*. While there might exist some incidental correlation between some of these words and their respective categories (perhaps citizens filing animal control complaints are nicer on average, so the term *please* could be used to identify those complaints more accurately, for example), it's likely this will just throw off our model's accuracy in the long run. --- From the EDA, the primary factors that should be considered when building the models are: 1. Account for imbalances in number of occurrences per category. 2. Reduce the number of categories by combining them into existing ones. 3. Add additional stop words. ## MLE Models ### Preparing the Data We'll opt to use Python for creating the MLE models, as Python libraries are generally more efficient and developed than their R counterparts. Because creating models is computer intensive, the code here has been evaluated locally and presented here for demonstration. To start, we'll import the necessary libraries and load the data using `pandas`. The models here will be built using `sklearn`. ```{python, eval=FALSE, echo=TRUE, warning=FALSE} import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.feature_selection import chi2 from sklearn.model_selection import train_test_split from sklearn.naive_bayes import MultinomialNB from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.svm import LinearSVC from sklearn.model_selection import cross_val_score from sklearn import metrics df = pd.read_csv("bengaluru-grievances.csv") ``` One of the packages imported here is `TfidfVectorizer`, which will be the algorithm used to create the models. I'll explain why I specifically chose this package later on. Here, I quickly apply the earlier data wrangling techniques by removing complaints less than 12 characters, this time using Python syntax. ```{python, eval=FALSE, echo=TRUE, warning=FALSE} # Reduce Character Limit character_limit = (df["description"].str.len() > 12) df = df.loc[character_limit] ``` A model based on over 100,000 observations is **extremely** hardware intensive, and it causes my laptop to overheat a lot. For practical purposes, we'll cut down on our data by choosing 15,000 rows at random. ```{python, eval=FALSE, echo=TRUE, warning=FALSE} df = df.sample(15000, random_state = 1).copy() ``` If we wanted to get another random 15,000 rows, we could change the `random_state = 1` argument to any other number, like 42, 671, or 7. With no other changes to be made, we can begin creating the models. ### Text Preprocessing There are several methods to building models, but the simplest method is to create a new column in our data - we'll call it **category_id** - that is a factor variable of all existing categories in our data. This essentially amounts to assigning each category a number (Electrical = 0, Road Engineering = 1, etc), which is necessary for getting our model to run properly, as `sklearn` will not understand strings as factors. ```{python, eval=FALSE, echo=TRUE, warning=FALSE} df['category_id'] = df["category"].factorize()[0] category_id_df = df[["category", "category_id"]].drop_duplicates() ``` Next, the description column (which stores the text for grievances) needs to be converted to vectors using a chosen algorithm. The algorithm chosen here is **Term Frequency - Inverse Document Frequency** (TF-IDF), which is the product of $TF$ and $IDF$ scores. This is the `TfidfVectorizer` function that we imported earlier. It's useful to present these terms mathematically. <br> **Term Frequency**: $$ TF = \frac{Number \hspace{0.15cm} of \hspace{0.15cm} times \hspace{0.15cm} a \hspace{0.15cm} term \hspace{0.15cm} appears \hspace{0.15cm} in \hspace{0.15cm} the \hspace{0.15cm} description}{Total \hspace{0.15cm} number \hspace{0.15cm} of \hspace{0.15cm} words \hspace{0.15cm} in \hspace{0.15cm} the \hspace{0.15cm} description} $$ **Inverse Document Frequency**: $$ IDF = log(\frac{Number \hspace{0.15cm} of \hspace{0.15cm} rows \hspace{0.15cm} in \hspace{0.15cm} a \hspace{0.15cm} data}{Number \hspace{0.15cm} of \hspace{0.15cm} rows \hspace{0.15cm} a \hspace{0.15cm} term \hspace{0.15cm} appears \hspace{0.15cm} in \hspace{0.15cm} a \hspace{0.15cm} data}) $$ **TF-IDF**: $$ TF-IDF = TF * IDF $$ <br> The reason for choosing this algorithm was for the $IDF$ component, which **downsamples** words that appear frequently across all complaints, while adding extra weight to terms that appear less often. Analyzing it mathematically: as the denominator of the $IDF$ variable increases, the closer it rapidly (more precisely, *exponentially*) approaches zero. Let $N$ represent the total number of rows in the data, and let $t$ represent a chosen term. If a certain word were to appear in **every** single complaint, then we would have $IDF(N, t) = log(\frac{N}{t}) = log(\frac{N}{N}) = log(1) = 0$, which would mean that when calculating $TF-IDF$, that specific word would have absolutely no weight attached to it when classifying complaints. Now; why do this? As discussed previously during the EDA section, almost 82% of all complaints fell into exactly three categories. Classification models will tend to stick to only a few categories, struggling to identify minor categories. While this may record higher accuracy scores on average, doing so means minor categories will rarely be classified at all, and in some instances, could lower overall accuracy if skew is significant enough. By ranking terms on an exponentially decreasing scale, we hope to reduce this issue. We first setup our `TfidfVectorizer` and assign it to a variable, `tfidf`. ```{python, eval=FALSE, echo=TRUE, warning=FALSE} tfidf = TfidfVectorizer( sublinear_tf = True, # Set term frequency to logarithmic scale min_df = 5, # Remove terms that appear less than 'min_df' times ngram_range = (1, 2), # Unigrams and bigrams are considered stop_words = 'english' # Use common English stop words ) ``` From here, we can begin building our models, but before doing so, let's see what the most common terms were for each category. To do so, we use `tfidf.get_feature_names_out()` on each category and assign that to a variable that we'll call **feature_names**. This contains all of the most common words associated with each category, which we then split into two separate lists for unigrams and bigrams (fancy words for "one word" and "two words"). From there, we print to console the $N = 3$ most common terms from each list. We wrap all this in a `for` loop, automatically progressing through each category. ```{python, eval=FALSE, echo=TRUE, warning=FALSE} # Defaults features = tfidf.fit_transform(df.description).toarray() labels = df.category_id N = 3 # For Loop for category, category_id in sorted(category_to_id.items()): features_chi2 = chi2(features, labels == category_id) indices = np.argsort(features_chi2[0]) feature_names = np.array(tfidf.get_feature_names_out())[indices] unigrams = [v for v in feature_names if len(v.split(' ')) == 1] bigrams = [v for v in feature_names if len(v.split(' ')) == 2] print("\n==> %s:" %(Product)) # Space for formatting print(" * Most Correlated Unigrams are: %s" %(', '.join(unigrams[-N:]))) print(" * Most Correlated Bigrams are: %s" %(', '.join(bigrams[-N:]))) ``` As this part was performed offline, here is the output in screenshots. ![](_images/term-correlation-1.png) ![](_images/term-correlation-2.png) ![](_images/term-correlation-3.png) The results are largely what we would expect, though there are some things to note. Some common phrases that appear for certain categories seemingly have nothing to do with them, such as the terms "77", "kindly", and "plz look", which is one of the most common bigrams for both "Education" and "Welfare Schemes." Remember, these were the categories that had less than 100 observations **total**. When we split the data to grab 15,000 random rows, these categories were split even further, which is probably why these nonsense phrases appear. ### Building the Models To begin, we first split the data into a 75:25 training and test split. The model will "learn" how to classify grievances based on the training data, and then it will "test" its accuracy on the remaining 25%. ```{python, eval=FALSE, echo=TRUE, warning=FALSE} # We define them here as independent variables X = df["description"] y = df["category"] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0) ``` There are several different models to choose from, but it's hard to know which will perform best before actually building them. That's why we'll test several models simultaneously by storing them in a list and looping through each model. ```{python, eval=FALSE, echo=TRUE, warning=FALSE} models = [ RandomForestClassifier(n_estimators = 100, max_depth = 5, random_state = 0), LinearSVC(), MultinomialNB(), LogisticRegression(random_state = 0) ] ``` Here, we stored in a list the following: + Random Forest Model + Linear Support Vector Classifier Model + Multinomial Naive Bayes Model + Logistic Regression Model After, we apply each model to the training data and record the results. The accuracy of each model is inherently random, as model performance is somewhat due to chance, so we'll use a five-fold cross-validation and take the mean average of each iteration to get a more balanced result. We store the results in a `pandas` dataframe for analysis. ```{python, eval=FALSE, echo=TRUE, warning=FALSE} # Copy and pasted from before features = tfidf.fit_transform(df.description).toarray() labels = df.category_id CV = 5 # Number of cross-validations cv_df = pd.DataFrame(index = range(CV * len(models))) # CV dataframe entries = [] # Array for storing model results for model in models: model_name = model.__class__.__name__ accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV) for fold_idx, accuracy in enumerate(accuracies): entries.append((model_name, fold_idx, accuracy)) cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy']) ``` ### Results Because we used a five-fold cross-validation, we have a total of 20 accuracy results - five for each model. We grab the mean accuracy and standard deviation for each model, storing them into a list. ```{python, eval=FALSE, echo=TRUE, warning=FALSE} mean_accuracy = cv_df.groupby('model_name').accuracy.mean() std_accuracy = cv_df.groupby('model_name').accuracy.std() accuracy = pd.concat([mean_accuracy, std_accuracy], axis= 1, ignore_index=True) accuracy.columns = ['Mean Accuracy', 'Standard Deviation'] accuracy ``` | Model | Mean Accuracy | Standard Deviation | |---------------------|---------------|--------------------| | Linear SVC | 88.773% | 0.368% | | Logistic Regression | 87.767% | 0.433% | | Multinomial NB | 85.720% | 0.117% | | Random Forest | 66.213% | 1.411% | The top three models all performed similarly as well, all falling within 3.1% percentage points. The Linear Support Vector Classifier performed the best among the three, while the Random Forest performed atrociously. Standard deviation among the top three remained fairly low, but especially for multinomial naive bayes. ## Improvements We will focus model improvement on the Linear SVC because it performed the best. As a reminder, these were the three main considerations before going in. 1. Account for imbalances in number of occurrences per category. 2. Consider reducing number of categories. 3. Consider adding additional stopwords. To get a better idea of how our model performed, we will plot a **confusion matrix**, which displays the total number of attempts our classification model made along with how many were accurately categorized. ```{python, eval=FALSE, echo=TRUE} # Recreating LinearSVC Model X_train, X_test, y_train, y_test,indices_train,indices_test = train_test_split(features, labels, df.index, test_size=0.25, random_state=1) model = LinearSVC() model.fit(X_train, y_train) y_pred = model.predict(X_test) # Confusion Matrix Plot conf_mat = confusion_matrix(y_test, y_pred) fig, ax = plt.subplots(figsize = (8, 8)) sns.heatmap(conf_mat, annot = True, cmap = "Greens", fmt = 'd', xticklabels = category_id_df.category.values, yticklabels = category_id_df.category.values) plt.xlabel("Predicted") plt.ylabel("Actual") plt.title("Confusion Matrix for LinearSVC \n", size = 20) ``` <center> ![](_images/heatmap.png) </center> On the diagonal are the number of rows that the model correctly predicted for each category. The horizontals and verticals represent the number of incorrect guesses, with the vertical representing incorrect guesses for that specific category. For example, the model correctly classified 1,484 complaints as "Electrical," incorrectly classified 2 as "Electrical" when they should of been classified as "COVID-19," and classified 10 as "Road Maintenance" when they should of been classified as "Electrical." ### Modifications Looking at the chart, the top three categories dominate the total number of occurrences, comprising 2,969 rows out of 3,750 in our test data. Most of the incorrect predictions appear in the vertical of each of these three columns, meaning the model was incorrectly classifying complaints as them often. Even though we chose an algorithm to specifically downsample those categories, our model still has a tendency to over-predict them. A few categories have 12 or fewer observations: those being Markets, Estate, OFC, Welfare Schemes, Advertisement, Education, Town Planning, Lakes, and Parks and Playgrounds. Converting these will likely improve accuracy considering how poorly our model did at predicting them, but there isn't clear category to merge them with. Many of these categories seem to have been falsely labeled as "Road Maintenance." While converting these columns over to this might lead to higher accuracy, it doesn't really make any sense in this case, and likely would hurt performance in the future. We could reassign these variables to "Others," but that category performed *abysmally*, only correctly predicting 3 out of 43 complaints. On one hand, moving them there probably won't hurt, but it likely won't improve "Others" result either. Lakes and Advertisements, which the model predicted quite a few correctly, will be left untouched for now. For the remaining categories under 12 test observations, they will be merged in with Others. ```{python, eval=FALSE, echo=TRUE} # Read in Data | Copy and Paste from Above df = pd.read_csv("bengaluru-grievances.csv") pd.DataFrame(df.category.unique()).values character_limit = (df["description"].str.len() > 12) df = df.loc[character_limit] df = df.sample(15000, random_state = 2).copy() # Select New Rows # Convert Columns df["category"] = df["category"].replace({'Markets': 'Others'}) df["category"] = df["category"].replace({'Estate': 'Others'}) df["category"] = df["category"].replace({'Welfare Schemes': 'Others'}) df["category"] = df["category"].replace({'Education': 'Others'}) df["category"] = df["category"].replace({'Town Planning': 'Others'}) ``` "Optical Fiber Cables" and "Storm Water Drains," though, are directly related to "Road Maintenance," and if we look at the chart, that's what the model ended up incorrectly guessing the most of. For these categories, it makes more sense to convert them over to "Road Maintenance" as opposed to "Others." ```{python, eval=FALSE, echo=TRUE} df["category"] = df["category"].replace({'Optical Fiber Cables (OFC)': 'Road Maintenance(Engg)'}) df["category"] = df["category"].replace({'Storm Water Drain(SWD)': 'Road Maintenance(Engg)'}) ``` While we've converted the total number of categories down from 20 to 13, we've only changed a total of 45 test rows. Even if the improved model were able to correctly predict all these observations now, we would only see an improvement of 1.2%. It's certainly not insignificant, but hardly substantial. Improving our stop word list, on the other hand, will hypothetically improve the accuracy of the model overall. Recall earlier when we found the most common unigrams and bigrams for each category. Several terms that appeared most often had little or nothing to do with their respective grievances, and should be able to be removed while maintaining or improving original accuracy. The original stop word list comes from another function in `sklearn`, and already contains over 300 words. We want to keep those words while adding to it, so we will union them together in a new list and use it in `TfidfVectorizer`. ```{python, eval=FALSE, echo=TRUE} # Import Function from sklearn.feature_extraction import text # Add Stop Words stop_words = text.ENGLISH_STOP_WORDS.union(["please", "plz", "look", "help", "causing", "coming", "kindly", "refused", "senior", "help", "one", "two", "three", "also", "77", "1", "2", "3", "since"]) # TfidfVectorizer tfidf = TfidfVectorizer( sublinear_tf = True, # Set term frequency to logarithmic scale min_df = 5, # Remove terms that appear less than 'min_df' times ngram_range = (1, 2), # Keep unigrams and bigrams stop_words = stop_words # Use custom stop words ) ``` ### Redo Text Preprocessing We have to redo the text preprocessing from earlier, so this is all copy-and-paste from before. ```{python, eval=FALSE, echo=TRUE} # Copy and Paste df['category_id'] = df["category"].factorize()[0] category_id_df = df[["category", "category_id"]].drop_duplicates() category_to_id = dict(category_id_df.values) id_to_category = dict(category_id_df[["category_id", "category"]].values) features = tfidf.fit_transform(df.description).toarray() labels = df.category_id X = df["description"] y = df["category"] X_train, X_test, y_train, y_test,indices_train,indices_test = train_test_split( features, labels, df.index, test_size = 0.25, random_state = 2) model = LinearSVC() model.fit(X_train, y_train) y_pred = model.predict(X_test) ``` ### New Results We don't need the complicated `for` loop from before because we only have one model this time. Therefore, we simply use `cross_val_score` as we did before and print the results to console. ```{python, eval=FALSE, echo=TRUE} accuracy_svc = cross_val_score(model, features, labels, scoring='accuracy', cv=CV) cv_mean_accuracy_svc = accuracy_svc.mean() cv_mean_std_svc = accuracy_svc.std() print(cv_mean_accuracy_svc * 100) print(cv_mean_std_svc * 100) ``` <center> ![Results](_images/results.png) </center> Our new Linear SVC model was able to achieve **91.987%** accuracy with an average standard deviation of *0.282%* across five iterations. That's an improvement of **3.214%** while also reducing variance within model performance by *0.086%*. Another way to look at these refinements; out of a possible **3,750** complaints, our model was able to correctly classify an additional *120* complaints, going from **3,328** correct predictions to **3,449**. <center> ![](_images/heatmap-2.png) </center> ## Conclusions Once again, the top three categories performed similarly as well as before. Road maintenance was able to correctly predict an additional *63* complaints on this iteration. These three categories also continue to make up most of the incorrect predictions. The "Others" categories once again performed dreadfully, only recording an additional two correct predictions despite even more chances. Given it's a category meant to be all-emcompassing, it probably makes sense to manually reclassify those comaplaints into new or existing categories. Minor categories saw little or no improvement. The "Health Dept" was the only category that performed worse, dropping from 69.7% to 54.6% accuracy. The model incorrectly chose "Solid Waste" and "Road Maintenance" much more often than the previous model did, though it's unclear as to why this is. Increasing the stop word list seems to have improved accuracy overall. A more thorough list and additional adjustments might boost performance slightly more. ## Source You can check out the Python source code for the MLE model [here](https://github.com/Nathan-States/Bengaluru-Text-Mining). Download the full dataset [here](https://www.dolthub.com/repositories/bkowshik/bbmp-sahaaya).