Employment management system for universities based on improved decision tree

Jing Li; Yongsheng Ma

doi:10.1515/jisys-2023-0138

Open Access Published by De Gruyter July 25, 2024

Employment management system for universities based on improved decision tree

Jing Li and Yongsheng Ma

From the journal Journal of Intelligent Systems

https://doi.org/10.1515/jisys-2023-0138

Abstract

With the popularization of higher education, the number of students in colleges and universities is increasing, and how to timely cope with the various problems faced by students in employment has become a major problem faced by teachers in colleges and universities. Due to the low utilization rate of student information by the traditional employment management of college graduates, the quality of employment guidance services is not high. Therefore, to solve this problem, this study proposes a simplified, improved Iterative Dichotomizer 3 (ID3) based on the correlation coefficient, and the algorithm improves the information gain function and simplifies the information entropy formula. The experimental results show that the simplified modified ID3 based on correlation coefficients converges faster than the other two algorithms, starting to converge after only 17 iterations; the loss value is also smaller than the other algorithms, at around 0.12. Its minimum accuracy, precision, recall, and F1-measure for employment status prediction were 86.4, 76.8, 72.8, and 0.82%, respectively, all higher than the rest of the algorithms. The time complexity at a sample size of 80 is only 32 ms, which is lower than the rest of the algorithms. It can be seen that the simplified and improved ID3 based on correlation coefficients can accurately and efficiently perform predictive analysis of graduates’ employment status. The university employment management system proposed in the study has achieved efficient deep utilization of graduate information through ID3, providing assistance to university employment decision-makers and reference for employment guidance for university graduates.

Keywords: decision trees; data mining; employment management; iterative dichotomizer 3

1 Introduction

With the increase in the college graduates, the devaluation of academic qualifications has become a certainty, which has led to the huge employment pressure faced by college graduates. At the same time, due to the influence of economic globalization, the employment choices of college graduates are diversified and blind, leading to the fact that most graduates do not have a sound employment outlook in the face of massive employment information, which exacerbates employment difficulties [1,2]. To solve the above problems, university teachers are required to analyze various types of information about graduates in order to provide employment guidance. However, due to the large number of graduates and the even larger amount of relevant information, existing data analysis technologies have many shortcomings in big data analysis. At the same time, in the face of the current employment characteristics of graduates, the traditional manual management method gradually reveals various problems; the manual entry of information leads to the inefficiency of employment guidance work and the low accuracy of employment information. This, coupled with the low utilization of various types of student information by traditional management methods, has led to poor quality of career guidance services.

Data mining technology is widely used in various fields because it can realize the rapid mining of implicit information of large-scale data. In university management, data mining techniques have also started to be frequently applied, effectively alleviating the problems of difficult statistical queries and the heavy workload of information entry in the process of university management. However, most of them are used in the teaching research of educational administration management machines and the employment analysis of college graduates. Common data mining techniques include decision trees, neural networks, regression analysis, clustering, association rules, Bayesian classification, etc. Among them, decision tree algorithms are widely used in the processing of discrete data due to the advantages of easy extraction of rules, fast running speed, and the ability to handle missing data and irrelevant features. As a type of discrete data, various types of data of students in higher education institutions are suitable for processing using decision trees [3,4]. Common decision tree algorithms include Classifier 4.5 (C4.5), Classification and Regression Tree, and Iterative Dichotomizer 3 (ID3). The current management systems for the education field generally use a combination of data mining and text mining methods, but they have the disadvantage of slow learning speed, and when the text set size is large, the rule base will be very large. In addition, this method is highly sensitive to data and prone to overfitting, making it unsuitable for the management of employment information for graduates. Therefore, in order to improve the utilization of graduation information for college graduates and avoid overfitting problems, it is necessary to develop more suitable data mining methods and employment management systems.

As the originator of the decision tree algorithm, ID3 has a clear theory, simple method, and strong learning ability and is widely used in the fields of classification, prediction, and rule extraction. However, the traditional ID3 has the disadvantages of easily falling into local optimum and overfitting, leading to its poor classification effect. Therefore, the study proposes a simplified and improved ID3 algorithm based on a correlation coefficient, which effectively improves the shortcomings of the traditional ID3 algorithm, applies it to the mining of university graduate data information, and establishes an employment management system in colleges and universities. The system improves the ID3 algorithm to the performance of college graduates’ performance, competition, and other personal data mining to determine the degree of the impact of each factor on employment, achieve accurate graduate employment prediction, and provide strong support for the employment guidance of college graduates. The innovation of this study lies in first improving the information gain function of the ID3 algorithm and simplifying the information entropy formula. Second, a college graduation management system was established to analyze the employment situation of college graduates from the aspects of gender, major, and competitive ability, providing a reference for the employment guidance work of college graduates. The university employment management system proposed in this study has achieved in-depth utilization of graduate information by improving the ID3 algorithm, providing assistance to university employment decision-makers and thus achieving the goal of increasing the graduation rate.

The article is divided into four parts. The first part will give a brief description of the informatization of university management systems and the application of the ID algorithm; the second part will investigate the simplified and enhanced ID3 based on correlation coefficients; the third part will analyze the experimental results; and the fourth part will summarize the full research.

2 Review of the literature

With the advancement of information technology, universities have gradually started to implement information management, which has greatly reduced the management costs of universities and relieved the pressure on management personnel; management systems for various aspects of universities have also emerged. Xu and Liu propose a university student results management system based on cloud storage technology to address the problems of long response time and low accuracy of university results query systems. The system reduces the cost of data storage and improves security through the cloud storage system. According to the findings, the data storage time is only 0.5 s, the query response time is only 0.3 μs, and the accuracy rate is over 80% [5]. Fan et al. propose an information management system based on data mining technology to address the problem of how to predict students’ learning behavior. The system uses association rules to mine implicit information in students’ educational data to predict their likely course choices. The system was tested to have a minimum support of 0.7 [6]. Muhamad and Darwesh proposed a library management system based on radio frequency identification technology for how to upgrade the quality and satisfaction of library services. It uses RFID technology to locate documents, which improves the speed and accuracy of document search; the borrowing process can also be processed quickly through RFID technology. The outcomes indicated that the system can quickly locate books that have not been accurately returned to their place [7]. Huang et al. proposed a load prediction method based on long- and short-term memory for the problem of load prediction in university public service management systems. The method predicts performance bottlenecks by mining the relationships between different modules. The outcomes indicated that the method has a high accuracy in predicting load trends and is more efficient compared with the rest of the methods [8]. Li proposed an intelligent campus management system based on Internet of Things (IoT) technology for the management problem of smart campuses. The system is managed in the backend of the system through a unified data collection source of face recognition terminal hardware products with IoT technology, and the data is calculated and analyzed to obtain valuable campus big data. The outcomes indicated that the system can effectively help teachers and students to develop teaching and learning plans, and users’ satisfaction score is above 8 [9].

The ID3 algorithm is extensively applied in various fields due to its low computational complexity, its suitability for high-dimensional data, and the construction of decision tree classifiers that do not require any domain knowledge or parameter settings. Wu et al. propose an intelligent classification system based on the improved ID3 to address the problem of distance education systems that are difficult to provide personalized instruction. The system can classify learners for personalized instruction [10]. De Guzman et al. proposed path-planning algorithms based on exhaustive data-driven energy models and evolutionary algorithms to address the difficulty of traditional data-driven energy models and path-planning algorithms in describing the motion trajectory of quadcopters. The experimental results show that the maximum difference in accuracy of the energy consumption model remains at 0.6% [11]. Harti et al. have proposed a wave prediction method based on the ID3 to address the problem of how to predict wave patterns in sea areas. The method can predict the size and location of waves from historical data on the rise and fall of the sea surface. According to the findings, the method can classify the sea surface with an accuracy of 88% [12]. Nurkholis et al. proposed a land analysis method based on the ID3 algorithm to address the problem of how to analyze the land use status. The method enables the analysis of the sustainability of the land to determine its impact on agriculture. According to the findings, the accuracy in analyzing the land use status is due to other methods [13]. Pathak et al. put forward an analysis method based on data mining for the cost and supply chain management of small and medium-sized enterprises during COVID-19. The results show that the complexity of cost management, social and cultural impacts, and economic differences collectively hinder the development of small and medium-sized enterprises. In addition, the risk perception of small enterprises was found to be inaccurate, which led to ineffective cost management strategies and supply chain management during the COVID-19 epidemic [14].

As mentioned above, with the advent of the “Internet+” era, many universities have begun to change their management methods from traditional to information-based management, and there are numerous different management systems with their own advantages. However, the development of systems for graduate employment management has lagged behind. Compared with academic management, the employment situation of graduates is difficult to predict accurately due to the many factors affecting it. Most of the graduate employment management systems only record the employment situation of the graduates and do not make full use of their information. The ID3 algorithm has the advantage of being easy to understand and interpret and is good at handling discrete data, so it can be used to process graduates’ past data. However, the traditional ID3 algorithm is limited by its own limitations, resulting in a low accuracy rate; therefore, the study proposes a simplified and improved ID3 algorithm based on correlation coefficients in order to achieve an accurate prediction of graduates’ employment status. In addition, based on the improvement of the ID3 algorithm, the research establishes the university employment management system. Compared with the traditional employment management system, the employment management system established realizes the full use of the information of previous graduates, provides strong support for the employment guidance work, and promotes the improvement of employment rate and employment quality.

3 A university employment management system based on the enhanced ID3

Among the data mining techniques, ID3 is widely used in various fields due to its advantages of complete search space, good robustness, and not easily affected by noise. To fully understand the employment quality and employment rate of university students, the study proposes an employment management system based on the ID3 as a way to realize the mining of employment data of university students.

3.1 Research on improved decision tree ID3

ID3 is a classic algorithm in decision tree algorithms, originating from concept learning systems. The decrease rate of information entropy is the standard for selecting test attributes. That is, the highest information gain attribute for each node that has not yet been used for classification is used as the classification standard. The decision tree obtained can perfectly classify the training samples, thus ending the process. Figure 1 illustrates the ID3 flow.

Figure 1

ID3 algorithm process.

As can be seen from Figure 1, the ID3 algorithm creates a simple node tree after initializing the threshold; if the samples are of the same kind, they are labeled and returned to the decision tree; otherwise, the feature set is determined, and if the feature set is empty, the decision tree is returned; otherwise, the information gain of each feature is calculated, and when the feature with the maximum gain is greater than the threshold, the decision tree is returned; otherwise, the output data are divided into different categories and the decision tree [15,16,17]. The formula for calculating the entropy of the training sample set is shown in the following equation:

(1) H ( s 1 , s 2 , … s m ) = − ∑ i = 1 m p i log p i .

In equation (1), s i denotes the sample, m denotes the number of categories of the sample, and p i denotes the probability that the sample belongs to the i category. The entropy of an attribute of a training sample is calculated by the formula:

(2) H ( A = a j ) = − ∑ i = 1 m p i j log 2 p i j H ( A ) = v ∑ j = 1 p j H ( A = a j ) , p j = d j n .

In equation (2), A indicates the attribute of the training sample set, v stands for the values of the sample attribute, and d j indicates the number of samples in the subset A = a j . The formula for calculating the information gain of the attribute A is given in the following equation:

(3) Gain ( A ) = H ( s 1 , s 2 , … , s m ) − H ( A ) .

In equation (3), Gain ( A ) denotes the information gain of the attribute A , E ( s 1 , s 2 , … , s m ) denotes the entropy of the sample set, and E ( A ) denotes the information entropy of the attribute A . Although the ID3 algorithm has the advantages of fast search speed and a small number of nodes, it is also easy to fall into local optimal solutions. It has the disadvantages of multi-value dependence and weak continuous data processing ability. Therefore, an upgraded ID3 based on correlation coefficients is proposed. The correlation coefficient between discrete variables can be calculated using the tau − y coefficient method, which is defined in the following equation:

(4) E 1 = ∑ ( n − F y ) ⋅ F y n E 2 = ∑ ( F x − f ) ⋅ f F x tau − y = E 1 − E 2 E 1 .

In equation (4), n indicates the samples, f indicates the conditions, F x indicates the edges of the variable, x F y stands for the edges of the variable y , E 1 indicates the error in predicting y when the variable x is not known, and E 2 indicates the error in predicting y when x is known. The improved formula for calculating the information gain is given in the following equation:

(5) g ( D , A ) = 1 n [ H ( D ) − ( 1 − ρ a y ) H ( D ∣ A ) ] .

In equation (5), g ( D , A ) represents the improved information gain, ρ a y represents the correlation coefficient in the attribute A and the category Y , and n represents the values of A . By introducing the correlation coefficient, the information gain of attributes with more values and less relevance is effectively reduced; the problem of multi-value bias of the ID3 is overcome. Also, as the calculation of the logarithm in information entropy is more complicated, the study simplifies it using Taylor’s number and McLaughlin’s formula [18,19,20]. The Taylor’s theorem formula is given in the following equation:

(6) f ( x ) = f ( x 0 ) + f ′ ( x 0 ) ( x − x 0 ) + f ″ ( x 0 ) 2 ! ( x − x 0 ) 2 + ⋯ + f ( n ) ( x 0 ) n ! ( x − x 0 ) n + f ( n + 1 ) ( x 0 ) ( n + 1 ) ! ( x − x 0 ) n + 1 .

In equation (6), ζ is taken to be in the range x 0 to x . Take x 0 = 0 and make ζ = θ x to obtain the McLaughlin formula, which is given in the following equation:

(7) f ( x ) = f ( 0 ) + f ′ ( 0 ) ( x ) + f ″ ( 0 ) 2 ! x 2 + ⋯ + f ( n ) ( 0 ) n ! x n + f ( n + 1 ) ( θ x ) ( n + 1 ) ! x n + 1 , ( 0 < θ < 1 )

The approximation equation (8) can be obtained from the following equation:

(8) f ( x ) ≈ f ( 0 ) + f ′ ( 0 ) ( x ) + f ″ ( 0 ) 2 ! x 2 + ⋯ + f ( n ) ( 0 ) n ! x n .

Equation (8) can be simplified to equation (9)

(9) ln ( x ) ≈ ( x − 1 ) − 1 2 ( x − 1 ) 2 + 1 3 ( x − 1 ) 3 + ⋯ + ( − 1 ) n − 1 1 n ( x − 1 ) n .

When x ∈ ( 0 , 1 ) , equation (9) can be rewritten as equation (10)

(10) ln ( x ) ≈ ( x − 1 ) − 1 2 ( x − 1 ) 2 .

From equation (10), it can be seen that the calculation speed is improved by a series of simplifications that reduce logarithmic operations to non-logarithmic operations. The rewritten formula for calculating information entropy is shown in equation (11)

(11) H ( X ) = − ∑ i = 1 n p i log p i = − ∑ i = 1 n p i ln p i ln 2 = − 1 ln 2 ∑ i = 1 n p i ln p i .

In equation (11), n represents the number of categories. Since p i ∈ ( 0 , 1 ) , taking equation (10) into the equation (11) yields equation (12)

(12) H ( X ) ≈ − 1 ln 2 ∑ i = 1 n p i ( p i − 1 ) − 1 2 ( p i − 1 ) 2 ≈ 1 ln 2 ∑ i = 1 n p i 1 2 ( p i − 1 ) 2 − ( p i − 1 ) .

In order to further simplify the information entropy calculation formula and the logarithmic operation and effectively improve the multi-value bias problem of the original ID3 algorithm, assuming that the number of values of the A is and the set n D consists of subsets of n by these n values, and each subset is divided into subsets of k and k is the number of categories of the D , the formula for calculating the information entropy at this point is given in equation (13)

(13) H ( D ) = − ∑ i = 1 k p i ′ log p i ′ H ( D ∣ A ) = ∑ i = 1 n ∣ C i ∣ ∑ m = 1 n ∣ C m ∣ − ∑ j = 1 k p j ″ log p j ″ p i ′ = ∑ j = 1 n ∣ C j i ∣ ∑ m = 1 n ∣ C m ∣ , p j ″ = ∣ C i j ∣ ∣ C i ∣ .

In equation (13), C m and C i denote the mth and ith subsets of the set D , respectively; C i j denotes the j th subset of the set C i . From equation (13), equation (14) is obtained

(14) H ( D ) = 1 ln 2 ∑ i = 1 n p i ′ 1 2 ( p i ′ − 1 ) 2 − ( p i ′ − 1 ) H ( D ∣ A ) = ∑ i = 1 n ∣ C i ∣ ∑ m = 1 n ∣ C m ∣ 1 ln 2 ∑ j = 1 k ( p j ′ ′ − 1 ) 2 − ( p j ′ ′ − 1 ) .

The information gain formula can be simplified by bringing equation (14) into equation (5), and the simplified improved information gain formula is shown in the following equation:

(15) g ( D ∣ A ) = 1 n [ H ( D ) − ( 1 − ∣ ρ a y ∣ ) H ( D ∣ A ) ] = 1 n ln 2 ∑ i = 1 n p ′ i 1 2 ( p ′ i − 1 ) 2 − ( p ′ i − 1 ) − 1 n ( 1 − ∣ ρ a y ∣ ) ∑ i = 1 n ∣ C i ∣ ∑ m = 1 n ∣ C m ∣ 1 ln 2 ∑ j = 1 k p ″ j 1 2 ( p ″ j − 1 ) 2 − ( p ″ j − 1 ) .

The simplified formula for calculating the improved information gain simplifies the logarithmic operations in information entropy to non-logarithmic operations, effectively reducing the time complexity.

3.2 Design of the university employment management system

As the number of university graduates increases, leading to an increasingly large amount of student information data, it is difficult to guide students’ employment work through the potentially valuable information contained therein. In order to improve this problem, the study proposes a university student employment analysis system based on the improved ID3 decision tree algorithm, which can help teachers in their employment guidance work by predicting the employment status of graduates through the student learning information data. The conceptual model design is the most important step in the database design, which is designed through the E–R model, which is shown in Figure 2.

Figure 2

E–R model.

The E–R model reflects the basic information of the company, the personal information of the students, and the effectiveness of the table structure, as well as the efficiency and results of data mining. Based on the E–R model, a database can be designed by combining various types of information about the graduates. As the ID3 algorithm is applicable to discrete data, continuous data need to be discretized first. The table structure of the basic information of graduates is shown in Table 1.

Table 1

Basic information of graduates

Field name	Type	Length (bytes)	Meaning
Stu_id	VARCHAR	12	Student ID
Stu_name	VARCHAR	20	Name
Stu_sex	VARCHAR	4	Gender
Stu_major	VARCHAR	20	Major
Stu_idcard	VARCHAR	20	ID number
Stu_politicalstatus	VARCHAR	10	Political outlook
Stu_address	VARCHAR	10	Origin
Stu_classleader	VARCHAR	2	Is it a class cadre

As can be seen from Table 1, the basic information database will provide statistics on the student’s name, place of origin, political affiliation, major, and whether he/she is a class officer. Table 2 illustrates the table structure for course and grade information.

Table 2

Course and grade information

Field name	Type	Length (bytes)	Meaning
c_no	VARCHAR	20	Course code
c_name	VARCHAR	30	Course name
c_type	VARCHAR	30	Course type
c_teacher	VARCHAR	20	Lecturer
c_year	VARCHAR	10	Opening academic year
id	INT	10	Number
Stu_id	VARCHAR	12	Student ID
Stu_name	VARCHAR	20	Name
c_score	VARCHAR	10	Course score

In Table 2, the database of courses and grades will provide a detailed record of graduates’ subject courses and grades. The structure of the student’s competition information table is illustrated in Table 3.

Table 3

Student competition information

Field name	Type	Length (bytes)	Meaning
id	INT	9	Number
Stu_id	VARCHAR	12	Student ID
Stu_name	VARCHAR	20	Name
co_name	VARCHAR	80	Competition name
co_type	VARCHAR	20	Competition type
g_level	VARCHAR	20	Award level
g_code	VARCHAR	5	Competition score
co_ability	VARCHAR	5	Competitive ability

As can be seen from Table 3, if a student participates in any of the competitions, the database will record data such as competition information and awards won. Before processing the employment information, the employment data of previous years should first be analyzed and the important components should be mined. The specific process is as follows. The first step is the definition of the mining object and target; the second step is data preparation, i.e., collecting various types of information of students; the third step is data pre-processing, i.e., discrete processing of continuous data notation, removing or perfecting dirty data, etc.; the fourth step is the establishment of a data mining model, i.e., constructing a prediction model of graduates’ employment; the fifth step is the evaluation of classification rules, i.e., analyzing the prediction results; the last step is the application of the classification model. Data pre-processing specifically includes three parts: data integration, data cleaning, and data imputation, of which the data set is to bring all kinds of data together. The structure of the employment information summary table is illustrated in Table 4.

Table 4

Summary of employment information

Field name	Type	Length (bytes)	Meaning
Stu_id	VARCHAR	12	Student ID
Stu_name	VARCHAR	20	Name
Stu_sex	VARCHAR	4	Gender
Stu_major	VARCHAR	20	Major
Stu_idcard	VARCHAR	20	ID number
Stu_politicalstatus	VARCHAR	10	Political outlook
Stu_address	VARCHAR	10	Origin
Stu_classleader	VARCHAR	2	Is it a class cadre
Pro_course	VARCHAR	10	Professional course grades
Pub_course	VARCHAR	10	Public basic course score
Pra_course	VARCHAR	10	Practical course grades
Is_pass	VARCHAR	10	CET-4 and CET-6 pass status
co_ability	VARCHAR	5	Competitive ability
em_status	VARCHAR	10	Employment status

Data cleaning is the elimination of noise from valid data and the processing of missing data, as well as the removal of invalid data. Data normalization is mainly the removal of redundant data and the transformation of data with different attributes.

4 Results and analysis

To verify the performance of the employment management system based on the improved ID3, simulated experiments were conducted on it and compared it with the upgraded ID3 based on attribute priority values and the upgraded ID3 based on correction functions. The data of the graduates of a university class were the test set, and the employment status was categorized into five types: not employed, further education/going abroad, large enterprises, small and medium enterprises, and state-owned enterprises. In the experiment, the adjustment coefficients for advanced mathematics, English, and professional courses were 0.4, 0.3, and 0.35, respectively. According to the calculation of the adjustment coefficients, the information gains for gender, major, class cadre status, professional course grades, basic course grades, practical course grades, competition ability, and passing of CET-4 and CET-6 were 0.0189, 0.0283, 0.0193, 0.0436, 0.0328, 0.0222, 0.0211, and 0.0281, respectively. Taking the postgraduate entrance examination as an example, the influence of different adjustment coefficients on the prediction accuracy is shown in Figure 3.

Figure 3

Effect of different adjustment coefficients on the prediction accuracy.

As shown in Figure 3, as the adjustment coefficient increases, the prediction accuracy first increases and then decreases. When the adjustment coefficients for advanced mathematics, English, and professional courses are 0.4, 0.3, and 0.35, respectively, the prediction accuracy is the highest, at 61.6, 60.2, and 65.9%, respectively. Table 5 illustrates the sample data statistics of the test set.

Table 5

Sample data statistics of the test set

Decision attributes	Generalization results	Large enterprises	Small- and medium-sized enterprise	State-owned enterprise	Continuing education/Going abroad	Pending employment
Gender	Female	79	96	16	104	7
Gender	Male	187	242	62	121	37
Major	Computer	98	168	34	109	16
	Software	82	21	12	53	10
	Network	53	83	32	52	14
	Internet of Things	25	16	5	7	1
Class cadre	Yes	70	46	14	56	4
Class cadre	No	191	294	11	55	4
Professional course grades	Excellent	76	88	28	115	12
	Good	125	141	32	77	22
	Mediocre	67	108	23	29	17
Public basic course score	Excellent	11	15	12	51	1
	Good	139	161	41	119	23
	Mediocre	109	162	33	53	23
Practical course grades	Excellent	78	86	30	98	13
	Good	166	218	43	111	24
	Mediocre	28	33	10	13	8
Approval status of CET-4/CET-6	Fail	27	35	6	10	8
	Through CET-4	193	251	57	135	29
	Through CET-6	46	60	24	77	7
Competitive ability	Strong	39	45	13	49	5
	Moderate	35	21	11	32	3
	Weak	187	374	59	143	39

As can be seen from Table 5, the data in the test set consisted of 951 items, which were divided into three specialties; the results of each subject were transformed into three grades: “excellent,” “good,” and “fair,” and the grades of Level 4 and 6 were divided into “failed,” “passed Level 4,” and “passed Level 6.” The data are divided into “failed,” “passed level 4,” and “passed level 6,” and the competition ability is divided into “strong, medium, and weak.” The three levels of competence are classified as “strong, medium, or weak”; effectively reducing the number of useless attributes in the data. The convergence of the simplified and upgraded ID3 based on correlation coefficients, the improved ID3 based on attribute priority values, and the improved ID3 based on correction functions are shown in Figure 4.

Figure 4

Convergence of three improved ID3 algorithms.

As can be seen from Figure 4, the modified ID3 algorithm based on the correction function converges after about 25 iterations, and the loss value is about 0.18; the improved ID3 algorithm based on the attribute priority value starts to converge after about 21 iterations, and the loss value is about 0.16 at this time; the simplified upgraded ID3 based on the correlation coefficient starts to converge after about 17 iterations, and the loss value is about 0.12 at this time. The findings demonstrated that the simplified and upgraded ID3 based on the correlation coefficient converges faster and has a smaller loss value. The prediction accuracy and false positive rates of the three improved ID3 algorithms for graduate employment are shown in Figure 5.

Figure 5

Prediction accuracy and misjudgment rate of different algorithms for graduates’ employment situation: (a) prediction accuracy of three improved ID3 algorithms and (b) error rate of three improved ID3 algorithms.

From Figure 5(a), the prediction accuracy of the improved ID3 algorithm based on the correction function for the five employment statuses of large enterprises, small and medium enterprises, state-owned enterprises, further studies/going abroad, and pending employment are about 83.5, 82.1, 87.4, 85.2, and 81.6%, respectively; the prediction accuracy of the improved ID3 based on the attribute priority value for the five employment statuses is also about 85.1, 84.3, 87.7, 86.1, and 83.3% respectively; the prediction accuracy of the simplified improved ID3 algorithm based on correlation coefficients for the five employment statuses was about 87.8, 86.5, 89.9, 88.7, and 86.4%, respectively. From Figure 5(b), the misjudgment rates of the improved ID3 based on the correction function for the five employment states were about 16.5, 17.9, 12.6, 14.8, and 18.4%, respectively; the misjudgment rates of the improved ID3 based on the attribute priority values were about 14.9, 15.7, 12.3, 13.9, and 16.7% respectively; the simplified ID3 based on the correlation coefficient The misclassification rates of the improved ID3 were about 12.2, 13.5, 9.1, 11.3, and 13.6%, respectively. The simplified and improved ID3 based on correlation coefficients has the highest prediction accuracy and the lowest false positive rate. The accuracy and recall rates of the three algorithms are shown in Figure 6.

Figure 6

Precision and recall of three algorithms: (a) precision of three improved ID3 algorithms and (b) recall rate of three improved ID3 algorithms.

From Figure 6(a), the prediction accuracy rates of the improved ID3 algorithm based on the correction function for the five employment states are about 76.7, 74.8, 78.2, 75.6, and 74.5%, respectively; the accuracy rates of the improved ID3 based on the attribute priority values are about 77.2, 75.3, 78.8, 76.3, and 75.4%, respectively; the accuracy rates of the simplified and improved ID3 based on correlation coefficients were about 78.4, 76.8, 80.1, 78.2, and 77.3%. From Figure 6(b), the recall rates of the improved ID3 based on the correction function for the five employment states are about 71.3, 70.6, 72.5, 71.3, and 70.7%, respectively; the recall rates of the improved ID3 based on the attribute priority value are about 71.8, 71.2, 73.1, 71.9, and 71.45%, respectively; the recall rates of the simplified improved ID3 based on the correlation coefficient are about 71.8, 71.2, 73.1, 71.9, and 71.45%, respectively. The recall rates of the simplified improved ID3 algorithm were about 73.1, 72.8, 74.2, 73.5, and 73.3%, respectively. The findings demonstrated that the accuracy and recall of the simplified and upgraded ID3 based on the correlation coefficient are better than the other two algorithms. The F1-measure and time complexity of the three algorithms are shown in Figure 7.

Figure 7

F1 measure and time complexity of three algorithms: (a) F1 measure of three algorithms and (b) time complexity of three improved ID3 algorithms.

From Figure 7(a), it can be seen that the F1-measure values of the improved ID3 based on the correction function for the five employment states are about 0.78, 0.81, 0.79, 0.83, and 0.82, respectively; the F1-measure of the improved ID3 based on the attribute priority values are about 0.8, 0.82, 0.84, 0.82, and 0.85, respectively; the F1-measure of the simplified improved ID3 based on the correlation coefficients are about 0.82, 0.84, 0.86, 0.85, and 0.88, respectively; where the F1-measure of the simplified and improved ID3 based on correlation coefficients is the highest. From Figure 7(b), the time complexity of all three algorithms increases with the increase of the number of samples. When the number of samples is 80, the time complexity of the three algorithms is about 43, 38, and 32 ms, respectively, with the simplified and enhanced ID3 based on correlation coefficient having the lowest time complexity. The P–R curves and ROC curves of the three improved ID3 algorithms are shown in Figure 8.

Figure 8

P–R curves and ROC curves of three improved ID3 algorithms: (a) P–R curves of three algorithms and (b) ROC curves of three algorithms.

From Figure 8(a), it can be seen that the equilibrium point of the P–R curve of the upgraded ID3 based on the correction function is (0.75, 0.75); the equilibrium point of the P–R curve of the upgraded ID3 based on the attribute priority value is (0.77, 0.77); and the equilibrium point of the simplified upgraded ID3 based on the correlation coefficient is (0.8, 0.8). From Figure 8(b), it can be seen that the area under the ROC curve of the improved ID3 based on the correction function and the attribute priority value is about 0.79 and 0.82, respectively; the area under the ROC curve of the simplified upgraded ID3 based on the correlation coefficient is about 0.87. The above results show that the performance of the simplified upgraded ID3 based on the correlation coefficient is better than the remaining two algorithms.

5 Conclusion

In recent years, as the number of graduates from universities continues to increase, the employment problems of graduates have become more acute, which requires university teachers to provide proper employment guidance to graduates. However, the student-related data are very large, resulting in the useful information implied in the data not being fully utilized. To improve the quality and efficiency of employment guidance, the study proposes a university employment management system based on the upgraded ID3 of the correlation coefficient, which can improve the quality of employment management by making full use of student past-related data. The outcomes indicated that the simplified upgraded ID3 based on the correlation coefficient starts to converge after about 17 iterations and converges faster than the remaining two improved ID3 algorithms; the loss value is about 0.12 at this point, which is lower than the remaining algorithms. In the experiments on the prediction of employment status, the accuracy of the simplified and upgraded ID3 based on the correlation coefficient for different employment statuses was about 87.8, 86.5, 89.9, 88.7, and 86.4%, respectively; the accuracy was about 78.4, 76.8, 80.1, 78.2, and 77.3%, respectively; the recall was about 73.1, 72.8, 74.2, 73.5, and 73.3%, respectively; F1-measure was around 0.82, 0.84, 0.86, 0.85, and 0.88, respectively; all the above metrics were higher than the rest of the algorithms. The misclassification rates were around 12.2, 13.5, 9.1, 11.3, and 13.6%, respectively, which were lower than the rest of the algorithms. The time complexity and area under the ROC curve of the simplified and upgraded ID3 based on the correlation coefficient with a sample size of 80 are 32 ms and 0.87, respectively, which shows that its time complexity is small and its comprehensive performance is good compared with other algorithms. The above results show that the simplified and upgraded ID3 based on correlation coefficients can achieve efficient and accurate processing of correlated data. Although the simplified and upgraded ID3 based on the correlation coefficient can achieve more accurate employment prediction analysis, it still has some errors and is lacking in data collection considering the privacy of students.

Acknowledgement

This study is supported by Ministry of Education Industry-University Cooperative Dducation Project: Weifang College and Shandong Weifang Runfeng Chemical Co., LTD to build an employment practice base.

Funding information: Authors state no funding involved.
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and consented to its submission to the journal, reviewed all the results and approved the final version of the manuscript. JL contributes to writing—original draft preparation, formal analysis, validation, software, visualization. YM contributes to writing—review and editing, methodology, data curation.
Conflict of interest: The authors declare that there is no conflict of interest in this article.
Data availability statement: All data generated or analysed during this study are included in this published article.

References

[1] Halper LR, Craft CA, Shi Y. Expanding the student employment literature: Investigating the practice of reflection in on-campus student employment. J Coll Stud Dev. 2020;61(4):516–21.10.1353/csd.2020.0045Search in Google Scholar

[2] Tan AW, Dwan CA, Ling TR, Thompson AJ, Peterson GM. Australian pharmacy student perceptions of employment in the pharmaceutical industry. J Pharm Pract Res. 2022;52(2):124–31.10.1002/jppr.1783Search in Google Scholar

[3] Li X, Hu Y, Xue B, Wang Y, Zhang Z, Li L, et al. State‐of‐health estimation for the lithium‐ion battery based on gradient boosting decision tree with autonomous selection of excellent features. Int J Energy Res. 2022;46(2):1756–65.10.1002/er.7292Search in Google Scholar

[4] Chen Y, He X, Xu J, Guo L, Lu Y, Zhang R. Decision tree-based classification in coastal area integrating polarimetric SAR and optical data. Data Technol Appl. 2022;56(3):342–57.10.1108/DTA-08-2019-0149Search in Google Scholar

[5] Xu M, Liu Y. Achievement management system for university students based on cloud storage technology. Int J Inf Commun Technol. 2022;20(1):18–33.10.1504/IJICT.2022.119312Search in Google Scholar

[6] Fan J, Zhang M, Sharma A, Kukkar A. Data mining applications in university information management system development. J Intell Syst. 2022;31(1):207–20.10.1515/jisys-2022-0006Search in Google Scholar

[7] Muhamad SS, Darwesh AM. Smart university library management system based on Internet of things. UHD J Sci Technol. 2020;4(2):63–74.10.21928/uhdjst.v4n2y2020.pp63-74Search in Google Scholar

[8] Huang L, Lee MY, Chen X, Tseng HW, Lee SF. Using microservice architecture as a load prediction strategy for management system of university public service. Sens Mater. 2021;33(2):805–14.10.18494/SAM.2021.3048Search in Google Scholar

[9] Li W. Design of smart campus management system based on internet of things technology. J Intell Fuzzy Syst. 2021;40(2):3159–68.10.3233/JIFS-189354Search in Google Scholar

[10] Wu Y, Zhang H, Li X. Improved ID3 algorithm based on intelligent computer distance education. Int J Electr Eng Telecommun Eng Intell Syst. 2020;28(4):223–7.Search in Google Scholar

[11] De Guzman CJP, Chua AY, Chu TS, Secco EL. Evolutionary algorithm-based energy-aware path planning with a quadrotor for warehouse inventory management. HighTech Innov J. 2023;4(4):829–37.10.28991/HIJ-2023-04-04-012Search in Google Scholar

[12] Harti AB, Christianto DH, Nabillah R, Oktavia M. Prediction of extreme sea water waves at ancol beach using ID3 algoritma algorithm. J Intell Decis Support Syst (IDSS). 2022;5(2):64–72.Search in Google Scholar

[13] Nurkholis A, Muhaqiqin M, Susanto T. Analisis kesesuaian lahan padi gogo berbasis sifat tanah dan cuaca menggunakan ID3 spasial (Land suitability analysis for upland rice based on soil and weather characteristics using spatial ID3). JUITA: J Inform. 2020;8(2):235–44.10.30595/juita.v8i2.8311Search in Google Scholar

[14] Pathak S, Swatdikun T, Hao Y. Cost management and supply chain management: Experiences of vulnerable SMEs during COVID-19. Emerg Sci J. 2023;7(6):2165–82.10.28991/ESJ-2023-07-06-018Search in Google Scholar

[15] Hafidh F, Kurniawan MY, Anwar RIY. Identifikasi ketunaan anak berkebutuhan khusus dengan algoritma iterative dichotomiser 3 (id3). J Buana Inform. 2021;12(2):78–87.10.24002/jbi.v12i2.4488Search in Google Scholar

[16] Guo Y, Mustafaoglu Z, Koundal D. Spam detection using bidirectional transformers and machine learning classifier algorithms. J Comput Cognit Eng. 2023;2(1):5–9.10.47852/bonviewJCCE2202192Search in Google Scholar

[17] Sunaryanto H, Hasan MA, Guntoro G. Classification analysis of unilak informatics engineering students using support vector machine (SVM), Iterative Dichotomiser 3 (ID3), random forest and k-nearest neighbors (KNN). IT J Res Dev. 2022;7(1):48–55.10.25299/itjrd.2022.8912Search in Google Scholar

[18] Mahmood T, Ali Z. Analysis of Maclaurin symmetric mean operators for managing complex interval-valued q-Rung orthopair fuzzy setting and their applications. J Comput Cognit Eng. 2023;2(2):98–115.10.47852/bonviewJCCE2202164Search in Google Scholar

[19] Omar MA. Performance evaluation of supervised machine learning classifiers for mapping natural language text to entity relationship models. J Pure Appl Sci. 2021;20(1):6–10.10.51984/jopas.v20i1.945Search in Google Scholar

[20] Omar M, Alsheky A, Faiz B. Novel rules for extracting the entities of entity relationship models. J Pure Appl Sci. 2021;20(2):29–35.10.51984/jopas.v20i2.1329Search in Google Scholar

Received: 2023-08-22

Accepted: 2024-04-17

Published Online: 2024-07-25

This work is licensed under the Creative Commons Attribution 4.0 International License.

Employment management system for universities based on improved decision tree

Abstract

1 Introduction

2 Review of the literature

3 A university employment management system based on the enhanced ID3

3.1 Research on improved decision tree ID3

3.2 Design of the university employment management system

4 Results and analysis

5 Conclusion

Acknowledgement

References

Journal and Issue

Articles in the same Issue