DATA SCIENCE FOR GROCERIES MARKET ANALYSIS, CLUSTERING, AND PREDICTION WITH PYTHON GUI

DATA SCIENCE FOR GROCERIES MARKET ANALYSIS, CLUSTERING, AND PREDICTION WITH PYTHON GUI PDF

Author: Vivian Siahaan

Publisher: BALIGE PUBLISHING

Published: 2022-05-03

Total Pages: 335

ISBN-13:

DOWNLOAD EBOOK →

The objective of this data science project is to analyze and predict customer behavior in the groceries market using Python and create a graphical user interface (GUI) using PyQt. The project encompasses various stages, starting from exploring the dataset and visualizing the distribution of features to RFM analysis, K-means clustering, predicting clusters with machine learning algorithms, and implementing a GUI for user interaction. The first step in this project involves exploring the dataset. We load the dataset containing information about customers' purchases in the groceries market and examine its structure. We check for missing values and perform data preprocessing if necessary, ensuring the dataset is ready for analysis. This initial exploration allows us to gain a better understanding of the data and its characteristics. Following the dataset exploration, we conduct exploratory data analysis (EDA). This step involves visualizing the distribution of different features within the dataset. By creating histograms, box plots, scatter plots, and other visualizations, we gain insights into the patterns, trends, and relationships within the data. EDA helps us identify outliers, understand feature distributions, and uncover potential correlations between variables. After the EDA phase, we move on to RFM analysis. RFM stands for Recency, Frequency, and Monetary analysis. In this step, we calculate three key metrics for each customer: recency (how recently a customer made a purchase), frequency (how often a customer made purchases), and monetary value (how much a customer spent). RFM analysis allows us to segment customers based on their purchasing behavior, identifying high-value customers and those who require re-engagement strategies. Once we have the clusters, we can utilize machine learning algorithms to predict the cluster for new or unseen customers. We train various models, including logistic regression, support vector machines, decision trees, k-nearest neighbors, random forests, gradient boosting, naive Bayes, adaboost, XGBoost, and LightGBM, on the clustered data. These models learn the patterns and relationships between customer features and their assigned clusters, enabling us to predict the cluster for new customers accurately. To evaluate the performance of our models, we utilize metrics such as accuracy, precision, recall, and F1-score. These metrics allow us to measure the models' predictive capabilities and compare their performance across different algorithms and preprocessing techniques. By assessing the models' performance, we can select the most suitable model for cluster prediction in the groceries market analysis. In addition to the analysis and prediction components, this project aims to provide a user-friendly interface for interaction and visualization. To achieve this, we implement a GUI using PyQt, a Python library for creating desktop applications. The GUI allows users to input new customer data and predict the corresponding cluster based on the trained models. It provides visualizations of the analysis results, including cluster distributions, confusion matrices, and decision boundaries. The GUI allows users to select different machine learning models and preprocessing techniques through radio buttons or dropdown menus. This flexibility empowers users to explore and compare the performance of various models, enabling them to choose the most suitable approach for their specific needs. The GUI's interactive nature enhances the usability of the project and promotes effective decision-making based on the analysis results. In conclusion, this project combines data science methodologies, including dataset exploration, visualization, RFM analysis, K-means clustering, predictive modeling, and GUI implementation, to provide insights into customer behavior and enable accurate cluster prediction in the groceries market. By leveraging these techniques, businesses can enhance their marketing strategies, improve customer targeting and retention, and ultimately drive growth and profitability in a competitive market landscape. The project's emphasis on user interaction and visualization through the GUI ensures that businesses can easily access and interpret the analysis results, making informed decisions based on data-driven insights.

THREE DATA SCIENCE PROJECTS FOR RFM ANALYSIS, K-MEANS CLUSTERING, AND MACHINE LEARNING BASED PREDICTION WITH PYTHON GUI

THREE DATA SCIENCE PROJECTS FOR RFM ANALYSIS, K-MEANS CLUSTERING, AND MACHINE LEARNING BASED PREDICTION WITH PYTHON GUI PDF

Author: Vivian Siahaan

Publisher: BALIGE PUBLISHING

Published: 2022-05-11

Total Pages: 627

ISBN-13:

DOWNLOAD EBOOK →

PROJECT 1: RFM ANALYSIS AND K-MEANS CLUSTERING: A CASE STUDY ANALYSIS, CLUSTERING, AND PREDICTION ON RETAIL STORE TRANSACTIONS WITH PYTHON GUI The dataset used in this project is the detailed data on sales of consumer goods obtained by ‘scanning’ the bar codes for individual products at electronic points of sale in a retail store. The dataset provides detailed information about quantities, characteristics and values of goods sold as well as their prices. The anonymized dataset includes 64.682 transactions of 5.242 SKU's sold to 22.625 customers during one year. Dataset Attributes are as follows: Date of Sales Transaction, Customer ID, Transaction ID, SKU Category ID, SKU ID, Quantity Sold, and Sales Amount (Unit price times quantity. For unit price, please divide Sales Amount by Quantity). This dataset can be analyzed with RFM analysis and can be clustered using K-Means algorithm. The machine learning models used in this project to predict clusters as target variable are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, LGBM, Gradient Boosting, XGB, and MLP. Finally, you will plot boundary decision, distribution of features, feature importance, cross validation score, and predicted values versus true values, confusion matrix, learning curve, performance of the model, scalability of the model, training loss, and training accuracy. PROJECT 2: DATA SCIENCE FOR GROCERIES MARKET ANALYSIS, CLUSTERING, AND PREDICTION WITH PYTHON GUI RFM analysis used in this project can be used as a marketing technique used to quantitatively rank and group customers based on the recency, frequency and monetary total of their recent transactions to identify the best customers and perform targeted marketing campaigns. The idea is to segment customers based on when their last purchase was, how often they've purchased in the past, and how much they've spent overall. Clustering, in this case K-Means algorithm, used in this project can be used to place similar customers into mutually exclusive groups; these groups are known as “segments” while the act of grouping is known as segmentation. Segmentation allows businesses to identify the different types and preferences of customers/markets they serve. This is crucial information to have to develop highly effective marketing, product, and business strategies. The dataset in this project has 38765 rows of the purchase orders of people from the grocery stores. These orders can be analyzed with RFM analysis and can be clustered using K-Means algorithm. The machine learning models used in this project to predict clusters as target variable are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, LGBM, Gradient Boosting, XGB, and MLP. Finally, you will plot boundary decision, distribution of features, feature importance, cross validation score, and predicted values versus true values, confusion matrix, learning curve, performance of the model, scalability of the model, training loss, and training accuracy. PROJECT 3: ONLINE RETAIL CLUSTERING AND PREDICTION USING MACHINE LEARNING WITH PYTHON GUI The dataset used in this project is a transnational dataset which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers. You will be using the online retail transnational dataset to build a RFM clustering and choose the best set of customers which the company should target. In this project, you will perform Cohort analysis and RFM analysis. You will also perform clustering using K-Means to get 5 clusters. The machine learning models used in this project to predict clusters as target variable are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, LGBM, Gradient Boosting, XGB, and MLP. Finally, you will plot boundary decision, distribution of features, feature importance, cross validation score, and predicted values versus true values, confusion matrix, learning curve, performance of the model, scalability of the model, training loss, and training accuracy.

RFM ANALYSIS AND K-MEANS CLUSTERING: A CASE STUDY ANALYSIS, CLUSTERING, AND PREDICTION ON RETAIL STORE TRANSACTIONS WITH PYTHON GUI

RFM ANALYSIS AND K-MEANS CLUSTERING: A CASE STUDY ANALYSIS, CLUSTERING, AND PREDICTION ON RETAIL STORE TRANSACTIONS WITH PYTHON GUI PDF

Author: Vivian Siahaan

Publisher: BALIGE PUBLISHING

Published: 2023-07-07

Total Pages: 390

ISBN-13:

DOWNLOAD EBOOK →

In this case study, we will explore RFM (Recency, Frequency, Monetary) analysis and K-means clustering techniques for retail store transaction data. RFM analysis is a powerful method for understanding customer behavior by segmenting them based on their transaction history. K-means clustering is a popular unsupervised machine learning algorithm used for grouping similar data points. We will leverage these techniques to gain insights, perform customer segmentation, and make predictions on retail store transactions. The case study involves a retail store dataset that contains transaction records, including customer IDs, transaction dates, purchase amounts, and other relevant information. This dataset serves as the foundation for our RFM analysis and clustering. RFM analysis involves evaluating three key aspects of customer behavior: recency, frequency, and monetary value. Recency refers to the time since a customer's last transaction, frequency measures the number of transactions made by a customer, and monetary value represents the total amount spent by a customer. By analyzing these dimensions, we can segment customers into different groups based on their purchasing patterns. Before conducting RFM analysis, we need to preprocess and transform the raw transaction data. This includes cleaning the data, aggregating it at the customer level, and calculating the recency, frequency, and monetary metrics for each customer. These transformed RFM metrics will be used for segmentation and clustering. Using the RFM metrics, we can apply clustering algorithms such as K-means to group customers with similar behaviors together. K-means clustering aims to partition the data into a predefined number of clusters based on their feature similarities. By clustering customers, we can identify distinct groups with different purchasing behaviors and tailor marketing strategies accordingly. K-means is an iterative algorithm that assigns data points to clusters in a way that minimizes the within-cluster sum of squares. It starts by randomly initializing cluster centers and then iteratively updates them until convergence. The resulting clusters represent distinct customer segments based on their RFM metrics. To determine the optimal number of clusters for our K-means analysis, we can employ elbow method. This method help us identify the number of clusters that provide the best balance between intra-cluster similarity and inter-cluster dissimilarity. Once the K-means algorithm has assigned customers to clusters, we can analyze the characteristics of each cluster. This involves examining the RFM metrics and other relevant customer attributes within each cluster. By understanding the distinct behavior patterns of each cluster, we can tailor marketing strategies and make targeted business decisions. Visualizations play a crucial role in presenting the results of RFM analysis and K-means clustering. We can create various visual representations, such as scatter plots, bar charts, and heatmaps, to showcase the distribution of customers across clusters and the differences in RFM metrics between clusters. These visualizations provide intuitive insights into customer segmentation. The objective of this data science project is to analyze and predict customer behavior in the groceries market using Python and create a graphical user interface (GUI) using PyQt. The project encompasses various stages, starting from exploring the dataset and visualizing the distribution of features to RFM analysis, K-means clustering, predicting clusters with machine learning algorithms, and implementing a GUI for user interaction. Once we have the clusters, we can utilize machine learning algorithms to predict the cluster for new or unseen customers. We train various models, including logistic regression, support vector machines, decision trees, k-nearest neighbors, random forests, gradient boosting, naive Bayes, adaboost, XGBoost, and LightGBM, on the clustered data. These models learn the patterns and relationships between customer features and their assigned clusters, enabling us to predict the cluster for new customers accurately. To evaluate the performance of our models, we utilize metrics such as accuracy, precision, recall, and F1-score. These metrics allow us to measure the models' predictive capabilities and compare their performance across different algorithms and preprocessing techniques. By assessing the models' performance, we can select the most suitable model for cluster prediction in the groceries market analysis. In addition to the analysis and prediction components, this project aims to provide a user-friendly interface for interaction and visualization. To achieve this, we implement a GUI using PyQt, a Python library for creating desktop applications. The GUI allows users to input new customer data and predict the corresponding cluster based on the trained models. It provides visualizations of the analysis results, including cluster distributions, confusion matrices, and decision boundaries. The GUI allows users to select different machine learning models and preprocessing techniques through radio buttons or dropdown menus. This flexibility empowers users to explore and compare the performance of various models, enabling them to choose the most suitable approach for their specific needs. The GUI's interactive nature enhances the usability of the project and promotes effective decision-making based on the analysis results.

SUPERMARKET SALES ANALYSIS AND PREDICTION USING MACHINE LEARNING WITH PYTHON GUI

SUPERMARKET SALES ANALYSIS AND PREDICTION USING MACHINE LEARNING WITH PYTHON GUI PDF

Author: Vivian Siahaan

Publisher: BALIGE PUBLISHING

Published: 2022-04-15

Total Pages: 187

ISBN-13:

DOWNLOAD EBOOK →

The dataset used in this project consists of the growth of supermarkets with high market competitions in most populated cities. The dataset is one of the historical sales of supermarket company which has recorded in 3 different branches for 3 months data. Predictive data analytics methods are easy to apply with this dataset. Attribute information in the dataset are as follows: Invoice id: Computer generated sales slip invoice identification number; Branch: Branch of supercenter (3 branches are available identified by A, B and C); City: Location of supercenters; Customer type: Type of customers, recorded by Members for customers using member card and Normal for without member card; Gender: Gender type of customer; Product line: General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel; Unit price: Price of each product in $; Quantity: Number of products purchased by customer; Tax: 5% tax fee for customer buying; Total: Total price including tax; Date: Date of purchase (Record available from January 2019 to March 2019); Time: Purchase time (10am to 9pm); Payment: Payment used by customer for purchase (3 methods are available – Cash, Credit card and Ewallet); COGS: Cost of goods sold; Gross margin percentage: Gross margin percentage; Gross income: Gross income; and Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10). In this project, you will perform predicting rating using machine learning. The machine learning models used in this project to predict clusters as target variable are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, LGBM, Gradient Boosting, XGB, and MLP. Finally, you will plot boundary decision, distribution of features, feature importance, cross validation score, and predicted values versus true values, confusion matrix, learning curve, performance of the model, scalability of the model, training loss, and training accuracy.

CUSTOMER SEGMENTATION, CLUSTERING, AND PREDICTION WITH PYTHON

CUSTOMER SEGMENTATION, CLUSTERING, AND PREDICTION WITH PYTHON PDF

Author: Vivian Siahaan

Publisher: BALIGE PUBLISHING

Published: 2023-07-04

Total Pages: 355

ISBN-13:

DOWNLOAD EBOOK →

In this book, we conducted a customer segmentation, clustering, and prediction analysis using Python. We began by exploring the customer dataset, examining its structure and contents. The dataset contained various features such as demographic, behavioral, and transactional attributes. To ensure accurate analysis and modeling, we performed data preprocessing steps. This involved handling missing values, removing duplicates, and addressing any data quality issues that could impact the results. We also split the dataset into features (X) and the target variable (y) for prediction tasks. Since the dataset had features with different scales and units, we applied feature scaling techniques. This process standardized or normalized the data, ensuring that all features contributed equally to the analysis. We then performed regression analysis on the "PURCHASESTRX" feature, which represents the number of purchase transactions made by customers. To begin the regression analysis, we first prepared the dataset by handling missing values, removing duplicates, and addressing any data quality issues. We then split the dataset into features (X) and the target variable (y), with "PURCHASESTRX" being the target variable for regression. We selected appropriate regression algorithms for modeling, such as Linear Regression, Random Forest, Naïve Bayes, KNN, Decision Trees, Support Vector, Ada Boost, Catboost, Gradient Boosting, Extreme Gradient Boosting, Light Gradient Boosting, and Multi-Layer Perceptron regressors. After training and evaluation, we analyzed the performance of the regression models. We examined the metrics to determine how accurately the models predicted the number of purchase transactions made by customers. A lower MAE and RMSE indicated better predictive performance, while a higher R2 score indicated a higher proportion of variance explained by the model. Based on the analysis, we provided insights and recommendations. These could include identifying factors that significantly influence the number of purchase transactions, understanding customer behavior patterns, or suggesting strategies to increase customer engagement and transaction frequency. Next, we focused on customer segmentation using unsupervised machine learning techniques. K-means clustering algorithm was employed to group customers into distinct segments. The optimal number of clusters was determined using KElbowVisualizer. To gain insights into the clusters, we visualized them 3D space. Dimensionality PCA reduction technique wasused to plot the clusters on scatter plots or 3D plots, enabling us to understand their separations and distributions. We then interpreted the segments by analyzing their characteristics. This involved identifying the unique features that differentiated one segment from another. We also pinpointed the key attributes or behaviors that contributed most to the formation of each segment. In addition to segmentation, we performed clusters prediction tasks using supervised machine learning techniques. Algorithms such as Logistic Regression, Random Forest, Naïve Bayes, KNN, Decision Trees, Support Vector, Ada Boost, Gradient Boosting, Extreme Gradient Boosting, Light Gradient Boosting, and Multi-Layer Perceptron Classifiers were chosen based on the specific problem. The models were trained on the training dataset and evaluated using the test dataset. To evaluate the performance of the prediction models, various metrics such as accuracy, precision, recall, F1-score, and ROC-AUC were utilized for classification tasks. Summarizing the findings and insights obtained from the analysis, we provided recommendations and actionable insights. These insights could be used for marketing strategies, product improvement, or customer retention initiatives.

ONLINE RETAIL CLUSTERING AND PREDICTION USING MACHINE LEARNING WITH PYTHON GUI

ONLINE RETAIL CLUSTERING AND PREDICTION USING MACHINE LEARNING WITH PYTHON GUI PDF

Author: Vivian Siahaan

Publisher: BALIGE PUBLISHING

Published: 2023-07-09

Total Pages: 302

ISBN-13:

DOWNLOAD EBOOK →

In this project, we embarked on a comprehensive journey of exploring the dataset and conducting analysis and predictions in the context of online retail. We began by examining the dataset and performing RFM (Recency, Frequency, Monetary Value) analysis, which allowed us to gain valuable insights into customer purchase behavior. Using the RFM analysis results, we applied K-means clustering, a popular unsupervised machine learning algorithm, to group customers into distinct clusters based on their RFM values. This clustering approach helped us identify different customer segments within the online retail dataset. After successfully clustering the customers, we proceeded to predict the clusters for new customer data. To achieve this, we trained various machine learning models, including logistic regression, support vector machines (SVM), K-nearest neighbors (KNN), decision trees, random forests, gradient boosting, naive Bayes, extreme gradient boosting, light gradient boosting, and multi-layer perceptron. These models were trained on the RFM features and the corresponding customer clusters. To evaluate the performance of the trained models, we employed a range of metrics such as accuracy, recall, precision, and F1 score. Additionally, we generated classification reports to gain a comprehensive understanding of the models' predictive capabilities. In order to provide a user-friendly and interactive experience, we developed a graphical user interface (GUI) using PyQt. The GUI allowed users to input customer information and obtain real-time predictions of the customer clusters using the trained machine learning models. This made it convenient for users to explore and analyze the clustering results. The GUI incorporated visualizations such as decision boundaries, which provided a clear representation of how the clusters were separated based on the RFM features. These visualizations enhanced the interpretation of the clustering results and facilitated better decision-making. To ensure the availability of the trained models for future use, we implemented model persistence by saving the trained models using the joblib library. This allowed us to load the models directly from the saved files without the need for retraining, thus saving time and resources. In addition to the real-time predictions, the GUI showcased performance evaluation metrics such as accuracy, recall, precision, and F1 score. This provided users with a comprehensive assessment of the model's performance and helped them gauge the reliability of the predictions. To delve deeper into the behavior and characteristics of the models, we conducted learning curve analysis, scalability analysis, and performance curve analysis. These analyses shed light on the models' learning capabilities, their performance with varying data sizes, and their overall effectiveness in making accurate predictions. The entire process from dataset exploration to RFM analysis, clustering, model training, GUI development, and real-time predictions was carried out seamlessly, leveraging the power of Python and its machine learning libraries. This approach allowed us to gain valuable insights into customer segmentation and predictive modeling in the online retail domain. By combining data analysis, clustering, machine learning, and GUI development, we were able to provide a comprehensive solution for online retail businesses seeking to understand their customers better and make data-driven decisions. The developed system offered an intuitive interface and accurate predictions, paving the way for enhanced customer segmentation and targeted marketing strategies. Overall, this project demonstrated the effectiveness of integrating machine learning techniques with graphical user interfaces to provide a user-friendly and interactive platform for analyzing and predicting customer clusters in the online retail industry.

5 FIVE DATA SCIENCE PROJECTS FOR ANALYSIS, CLASSIFICATION, PREDICTION, AND SENTIMENT ANALYSIS WITH PYTHON GUI

5 FIVE DATA SCIENCE PROJECTS FOR ANALYSIS, CLASSIFICATION, PREDICTION, AND SENTIMENT ANALYSIS WITH PYTHON GUI PDF

Author: Vivian Siahaan

Publisher: BALIGE PUBLISHING

Published: 2022-04-29

Total Pages: 979

ISBN-13:

DOWNLOAD EBOOK →

PROJECT 1: SUPERMARKET SALES ANALYSIS AND PREDICTION USING MACHINE LEARNING WITH PYTHON GUI The dataset used in this project consists of the growth of supermarkets with high market competitions in most populated cities. The dataset is one of the historical sales of supermarket company which has recorded in 3 different branches for 3 months data. Predictive data analytics methods are easy to apply with this dataset. Attribute information in the dataset are as follows: Invoice id: Computer generated sales slip invoice identification number; Branch: Branch of supercenter (3 branches are available identified by A, B and C); City: Location of supercenters; Customer type: Type of customers, recorded by Members for customers using member card and Normal for without member card; Gender: Gender type of customer; Product line: General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel; Unit price: Price of each product in $; Quantity: Number of products purchased by customer; Tax: 5% tax fee for customer buying; Total: Total price including tax; Date: Date of purchase (Record available from January 2019 to March 2019); Time: Purchase time (10am to 9pm); Payment: Payment used by customer for purchase (3 methods are available – Cash, Credit card and Ewallet); COGS: Cost of goods sold; Gross margin percentage: Gross margin percentage; Gross income: Gross income; and Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10). In this project, you will perform predicting rating using machine learning. The machine learning models used in this project to predict clusters as target variable are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, LGBM, Gradient Boosting, XGB, and MLP. Finally, you will plot boundary decision, distribution of features, feature importance, cross validation score, and predicted values versus true values, confusion matrix, learning curve, performance of the model, scalability of the model, training loss, and training accuracy. PROJECT 2: DETECTING CYBERBULLYING TWEETS USING MACHINE LEARNING AND DEEP LEARNING WITH PYTHON GUI As social media usage becomes increasingly prevalent in every age group, a vast majority of citizens rely on this essential medium for day-to-day communication. Social media’s ubiquity means that cyberbullying can effectively impact anyone at any time or anywhere, and the relative anonymity of the internet makes such personal attacks more difficult to stop than traditional bullying. On April 15th, 2020, UNICEF issued a warning in response to the increased risk of cyberbullying during the COVID-19 pandemic due to widespread school closures, increased screen time, and decreased face-to-face social interaction. The statistics of cyberbullying are outright alarming: 36.5% of middle and high school students have felt cyberbullied and 87% have observed cyberbullying, with effects ranging from decreased academic performance to depression to suicidal thoughts. In light of all of this, this dataset contains more than 47000 tweets labelled according to the class of cyberbullying: Age; Ethnicity; Gender; Religion; Other type of cyberbullying; and Not cyberbullying. The data has been balanced in order to contain ~8000 of each class. The models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, Adaboost, LGBM classifier, Gradient Boosting, XGB classifier, LSTM, and CNN. Three feature scaling used in machine learning are raw, minmax scaler, and standard scaler. Finally, you will develop a GUI using PyQt5 to plot cross validation score, predicted values versus true values, confusion matrix, learning curve, decision boundaries, performance of the model, scalability of the model, training loss, and training accuracy. PROJECT 3: HIGHER EDUCATION STUDENT ACADEMIC PERFORMANCE ANALYSIS AND PREDICTION USING MACHINE LEARNING WITH PYTHON GUI The dataset used in this project was collected from the Faculty of Engineering and Faculty of Educational Sciences students in 2019. The purpose is to predict students' end-of-term performances using ML techniques. Attribute information in the dataset are as follows: Student ID; Student Age (1: 18-21, 2: 22-25, 3: above 26); Sex (1: female, 2: male); Graduated high-school type: (1: private, 2: state, 3: other); Scholarship type: (1: None, 2: 25%, 3: 50%, 4: 75%, 5: Full); Additional work: (1: Yes, 2: No); Regular artistic or sports activity: (1: Yes, 2: No); Do you have a partner: (1: Yes, 2: No); Total salary if available (1: USD 135-200, 2: USD 201-270, 3: USD 271-340, 4: USD 341-410, 5: above 410); Transportation to the university: (1: Bus, 2: Private car/taxi, 3: bicycle, 4: Other); Accommodation type in Cyprus: (1: rental, 2: dormitory, 3: with family, 4: Other); Mother's education: (1: primary school, 2: secondary school, 3: high school, 4: university, 5: MSc., 6: Ph.D.); Father's education: (1: primary school, 2: secondary school, 3: high school, 4: university, 5: MSc., 6: Ph.D.); Number of sisters/brothers (if available): (1: 1, 2:, 2, 3: 3, 4: 4, 5: 5 or above); Parental status: (1: married, 2: divorced, 3: died - one of them or both); Mother's occupation: (1: retired, 2: housewife, 3: government officer, 4: private sector employee, 5: self-employment, 6: other); Father's occupation: (1: retired, 2: government officer, 3: private sector employee, 4: self-employment, 5: other); Weekly study hours: (1: None, 2: <5 hours, 3: 6-10 hours, 4: 11-20 hours, 5: more than 20 hours); Reading frequency (non-scientific books/journals): (1: None, 2: Sometimes, 3: Often); Reading frequency (scientific books/journals): (1: None, 2: Sometimes, 3: Often); Attendance to the seminars/conferences related to the department: (1: Yes, 2: No); Impact of your projects/activities on your success: (1: positive, 2: negative, 3: neutral); Attendance to classes (1: always, 2: sometimes, 3: never); Preparation to midterm exams 1: (1: alone, 2: with friends, 3: not applicable); Preparation to midterm exams 2: (1: closest date to the exam, 2: regularly during the semester, 3: never); Taking notes in classes: (1: never, 2: sometimes, 3: always); Listening in classes: (1: never, 2: sometimes, 3: always); Discussion improves my interest and success in the course: (1: never, 2: sometimes, 3: always); Flip-classroom: (1: not useful, 2: useful, 3: not applicable); Cumulative grade point average in the last semester (/4.00): (1: <2.00, 2: 2.00-2.49, 3: 2.50-2.99, 4: 3.00-3.49, 5: above 3.49); Expected Cumulative grade point average in the graduation (/4.00): (1: <2.00, 2: 2.00-2.49, 3: 2.50-2.99, 4: 3.00-3.49, 5: above 3.49); Course ID; and OUTPUT: Grade (0: Fail, 1: DD, 2: DC, 3: CC, 4: CB, 5: BB, 6: BA, 7: AA). The models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, Adaboost, LGBM classifier, Gradient Boosting, and XGB classifier. Three feature scaling used in machine learning are raw, minmax scaler, and standard scaler. Finally, you will develop a GUI using PyQt5 to plot cross validation score, predicted values versus true values, confusion matrix, learning curve, decision boundaries, performance of the model, scalability of the model, training loss, and training accuracy. PROJECT 4: COMPANY BANKRUPTCY ANALYSIS AND PREDICTION USING MACHINE LEARNING WITH PYTHON GUI The dataset was collected from the Taiwan Economic Journal for the years 1999 to 2009. Company bankruptcy was defined based on the business regulations of the Taiwan Stock Exchange. Attribute information in the dataset are as follows: Y - Bankrupt?: Class label; X1 - ROA(C) before interest and depreciation before interest: Return On Total Assets(C); X2 - ROA(A) before interest and % after tax: Return On Total Assets(A); X3 - ROA(B) before interest and depreciation after tax: Return On Total Assets(B); X4 - Operating Gross Margin: Gross Profit/Net Sales; X5 - Realized Sales Gross Margin: Realized Gross Profit/Net Sales; X6 - Operating Profit Rate: Operating Income/Net Sales; X7 - Pre-tax net Interest Rate: Pre-Tax Income/Net Sales; X8 - After-tax net Interest Rate: Net Income/Net Sales; X9 - Non-industry income and expenditure/revenue: Net Non-operating Income Ratio; X10 - Continuous interest rate (after tax): Net Income-Exclude Disposal Gain or Loss/Net Sales; X11 - Operating Expense Rate: Operating Expenses/Net Sales; X12 - Research and development expense rate: (Research and Development Expenses)/Net Sales X13 - Cash flow rate: Cash Flow from Operating/Current Liabilities; X14 - Interest-bearing debt interest rate: Interest-bearing Debt/Equity; X15 - Tax rate (A): Effective Tax Rate; X16 - Net Value Per Share (B): Book Value Per Share(B); X17 - Net Value Per Share (A): Book Value Per Share(A); X18 - Net Value Per Share (C): Book Value Per Share(C); X19 - Persistent EPS in the Last Four Seasons: EPS-Net Income; X20 - Cash Flow Per Share; X21 - Revenue Per Share (Yuan ¥): Sales Per Share; X22 - Operating Profit Per Share (Yuan ¥): Operating Income Per Share; X23 - Per Share Net profit before tax (Yuan ¥): Pretax Income Per Share; X24 - Realized Sales Gross Profit Growth Rate; X25 - Operating Profit Growth Rate: Operating Income Growth; X26 - After-tax Net Profit Growth Rate: Net Income Growth; X27 - Regular Net Profit Growth Rate: Continuing Operating Income after Tax Growth; X28 - Continuous Net Profit Growth Rate: Net Income-Excluding Disposal Gain or Loss Growth; X29 - Total Asset Growth Rate: Total Asset Growth; X30 - Net Value Growth Rate: Total Equity Growth; X31 - Total Asset Return Growth Rate Ratio: Return on Total Asset Growth; X32 - Cash Reinvestment %: Cash Reinvestment Ratio X33 - Current Ratio; X34 - Quick Ratio: Acid Test; X35 - Interest Expense Ratio: Interest Expenses/Total Revenue; X36 - Total debt/Total net worth: Total Liability/Equity Ratio; X37 - Debt ratio %: Liability/Total Assets; X38 - Net worth/Assets: Equity/Total Assets; X39 - Long-term fund suitability ratio (A): (Long-term Liability+Equity)/Fixed Assets; X40 - Borrowing dependency: Cost of Interest-bearing Debt; X41 - Contingent liabilities/Net worth: Contingent Liability/Equity; X42 - Operating profit/Paid-in capital: Operating Income/Capital; X43 - Net profit before tax/Paid-in capital: Pretax Income/Capital; X44 - Inventory and accounts receivable/Net value: (Inventory+Accounts Receivables)/Equity; X45 - Total Asset Turnover; X46 - Accounts Receivable Turnover; X47 - Average Collection Days: Days Receivable Outstanding; X48 - Inventory Turnover Rate (times); X49 - Fixed Assets Turnover Frequency; X50 - Net Worth Turnover Rate (times): Equity Turnover; X51 - Revenue per person: Sales Per Employee; X52 - Operating profit per person: Operation Income Per Employee; X53 - Allocation rate per person: Fixed Assets Per Employee; X54 - Working Capital to Total Assets; X55 - Quick Assets/Total Assets; X56 - Current Assets/Total Assets; X57 - Cash/Total Assets; X58 - Quick Assets/Current Liability; X59 - Cash/Current Liability; X60 - Current Liability to Assets; X61 - Operating Funds to Liability; X62 - Inventory/Working Capital; X63 - Inventory/Current Liability X64 - Current Liabilities/Liability; X65 - Working Capital/Equity; X66 - Current Liabilities/Equity; X67 - Long-term Liability to Current Assets; X68 - Retained Earnings to Total Assets; X69 - Total income/Total expense; X70 - Total expense/Assets; X71 - Current Asset Turnover Rate: Current Assets to Sales; X72 - Quick Asset Turnover Rate: Quick Assets to Sales; X73 - Working capitcal Turnover Rate: Working Capital to Sales; X74 - Cash Turnover Rate: Cash to Sales; X75 - Cash Flow to Sales; X76 - Fixed Assets to Assets; X77 - Current Liability to Liability; X78 - Current Liability to Equity; X79 - Equity to Long-term Liability; X80 - Cash Flow to Total Assets; X81 - Cash Flow to Liability; X82 - CFO to Assets; X83 - Cash Flow to Equity; X84 - Current Liability to Current Assets; X85 - Liability-Assets Flag: 1 if Total Liability exceeds Total Assets, 0 otherwise; X86 - Net Income to Total Assets; X87 - Total assets to GNP price; X88 - No-credit Interval; X89 - Gross Profit to Sales; X90 - Net Income to Stockholder's Equity; X91 - Liability to Equity; X92 - Degree of Financial Leverage (DFL); X93 - Interest Coverage Ratio (Interest expense to EBIT); X94 - Net Income Flag: 1 if Net Income is Negative for the last two years, 0 otherwise; and X95 - Equity to Liabilitys. The models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, Adaboost, LGBM classifier, Gradient Boosting, and XGB classifier. Three feature scaling used in machine learning are raw, minmax scaler, and standard scaler. Finally, you will develop a GUI using PyQt5 to plot cross validation score, predicted values versus true values, confusion matrix, learning curve, decision boundaries, performance of the model, scalability of the model, training loss, and training accuracy. PROJECT 5: DATA SCIENCE FOR RAIN CLASSIFICATION AND PREDICTION WITH PYTHON GUI This dataset contains about 10 years of daily weather observations from many locations across Australia. RainTomorrow is the target variable to predict. You will determine rain or not in the next day. This column is Yes if the rain for that day was 1mm or more. Observations were drawn from numerous weather stations. The daily observations are available from http://www.bom.gov.au/climate/data. The dataset contains 23 attributes. Some of them are as follows: About some of them are: DATE - The date of observation; LOCATION - The common name of the location of the weather station; MINTEMP - The minimum temperature in degrees celsius; MAXTEMP - The maximum temperature in degrees celsius; RAINFALL - The amount of rainfall recorded for the day in mm; EVAPORATION - The so-called Class A pan evaporation (mm) in the 24 hours to 9am; SUNSHINE - The number of hours of bright sunshine in the day; WINDGUESTDIR - The direction of the strongest wind gust in the 24 hours to midnight; WINDGUESTSPEED- The speed (km/h) of the strongest wind gust in the 24 hours to midnight; and WINDDIR9AM - Direction of the wind at 9am. The models used in this project are K-Nearest Neighbor, Random Forest, Naive Bayes, Logistic Regression, Decision Tree, Support Vector Machine, Adaboost, LGBM classifier, Gradient Boosting, and XGB classifier. Three feature scaling used in machine learning are raw, minmax scaler, and standard scaler. Finally, you will develop a GUI using PyQt5 to plot cross validation score, predicted values versus true values, confusion matrix, learning curve, decision boundaries, performance of the model, scalability of the model, training loss, and training accuracy.

Python for Marketing Research and Analytics

Python for Marketing Research and Analytics PDF

Author: Jason S. Schwarz

Publisher: Springer Nature

Published: 2020-11-03

Total Pages: 272

ISBN-13: 3030497208

DOWNLOAD EBOOK →

This book provides an introduction to quantitative marketing with Python. The book presents a hands-on approach to using Python for real marketing questions, organized by key topic areas. Following the Python scientific computing movement toward reproducible research, the book presents all analyses in Colab notebooks, which integrate code, figures, tables, and annotation in a single file. The code notebooks for each chapter may be copied, adapted, and reused in one's own analyses. The book also introduces the usage of machine learning predictive models using the Python sklearn package in the context of marketing research. This book is designed for three groups of readers: experienced marketing researchers who wish to learn to program in Python, coming from tools and languages such as R, SAS, or SPSS; analysts or students who already program in Python and wish to learn about marketing applications; and undergraduate or graduate marketing students with little or no programming background. It presumes only an introductory level of familiarity with formal statistics and contains a minimum of mathematics.

Data Science and Machine Learning

Data Science and Machine Learning PDF

Author: Dirk P. Kroese

Publisher: CRC Press

Published: 2019-11-20

Total Pages: 538

ISBN-13: 1000730778

DOWNLOAD EBOOK →

Focuses on mathematical understanding Presentation is self-contained, accessible, and comprehensive Full color throughout Extensive list of exercises and worked-out examples Many concrete algorithms with actual code

Foundations of Data Science

Foundations of Data Science PDF

Author: Avrim Blum

Publisher: Cambridge University Press

Published: 2020-01-23

Total Pages: 433

ISBN-13: 1108617360

DOWNLOAD EBOOK →

This book provides an introduction to the mathematical and algorithmic foundations of data science, including machine learning, high-dimensional geometry, and analysis of large networks. Topics include the counterintuitive nature of data in high dimensions, important linear algebraic techniques such as singular value decomposition, the theory of random walks and Markov chains, the fundamentals of and important algorithms for machine learning, algorithms and analysis for clustering, probabilistic models for large networks, representation learning including topic modelling and non-negative matrix factorization, wavelets and compressed sensing. Important probabilistic techniques are developed including the law of large numbers, tail inequalities, analysis of random projections, generalization guarantees in machine learning, and moment methods for analysis of phase transitions in large random graphs. Additionally, important structural and complexity measures are discussed such as matrix norms and VC-dimension. This book is suitable for both undergraduate and graduate courses in the design and analysis of algorithms for data.