This project explores sentiment analysis, a technique used to understand emotions expressed in text. We delve into the world of movie reviews, applying sentiment analysis techniques to uncover audience sentiment towards various films. This can provide valuable insights for filmmakers, studios, and moviegoers alike. For more analysis and artificial intelligence related content visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
2. INDEX
01
02
03
04
05 10
DATA COLLECTION
DATA PREPROCESSING
FEATURE EXTRACTION
MODEL SELECTION
MODEL TRAINING:
FINE-TUNING AND ITERATION
09
08
07
06
CONCLUSION
INTERPRETABILITY
ERROR ANALYSIS
MODEL EVALUATION
3. Data Collection
For this project, I selected the IBDM
dataset from the website Kaggle
Link for dataset
5. • HTML tags removed
• All uppercase word converted into lower case
• Non-Alphanumeric Characters removed
• Extra Whitespaces removed
• Tokenization done
• Stemming used instead of Lemmatization because
Lemmatization take more run time
• Duplicate values removed
6. FEATURE EXTRACTION
Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical
measure used in natural language processing (NLP) and information
retrieval to evaluate the importance of a word in a document relative to a
collection of documents, typically a corpus. TF-IDF combines two
components: Term Frequency (TF) and Inverse Document Frequency
(IDF).
Term Frequency-Inverse Document Frequency (TF-
IDF) is used
9. MODEL EVALUATION
Approximately all models got same result
1.Accuracy: Proportion of correctly classified instances among all instances.
2.Precision: Proportion of true positive predictions among all positive
predictions.
3.Recall: Proportion of true positive predictions among all actual positive
instances.
4.F1-score: Harmonic mean of precision and recall, providing a balance
between them.
5.ROC-AUC: Area under the Receiver Operating Characteristic (ROC)
6.curve, measuring the model's ability to discriminate between positive and
negative instances.
10. ERROR ANALYSIS
Approximately all models got same result,tried to resolve but not got solutions
When i was converting X_train, X_test into array the session was crashing
so i minimized the number of inputs and output while training this maybe the
reason output efficiency is not good
17. Future expectations:
• Instead of TFID Word2Vec can be used
• More models can be used like CNN, Logistic Regression
• Instead Of Stemming , Lemmatization could be used