The Modern Freelancer and Student's Workbook
Enter your email below to receive my Freelance Tips, free courses, research papers, Test banks, Books, journals and more.

INSTITUTION

COURSE

LECTURER

NAME

DATE

Introduction

1a. Initial Data Exploration

Attribute types

First way:

Attribute Name: BATHRAM

Attribute Type: Nominal

Justification: The Attribute type of Bathram is ‘Nominal’ as it cannot be ordered; it represents discrete             units and is used to label variables that have no quantitative value. Such type of attributes usually does not have a meaningful order. This implies that they are not quantitative. The values of a nominal attribute are symbols or names of things. – Each value represents some kind of category, code, or state Nominal attributes are also referred to as categorical attributes. The values of nominal attributes do not have any meaningful order. Example: The attribute marital status can take on the values single, married, divorced, and widowed. Because nominal attribute values do not have any meaningful order about them and they are not quantitative. – It makes no sense to find the mean (average) value or median (middle) value for such an attribute. – However, we can find the attribute’s most commonly occurring value (mode). A binary attribute is a special nominal attribute with only two states: 0 or 1. A binary attribute is symmetric if both of its states are equally valuable and carry the same weight. – Example: the attribute gender having the states male and female. A binary attribute is asymmetric if the outcomes of the states are not equally important. – Example: Positive and negative outcomes of a medical test for HIV. – By convention, we code the most important outcome, which is usually the rarest one, by 1 (e.g., HIV positive) and the other by 0 (e.g., HIV negative).

The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, ¹)

Attribute Name: HF-BATHRM

Attribute Type: Interval

Justification: This interval represents real data which can be measured and ranked. It gives an aspect of quantitative data which can be assigned specific values of the measures of central tendency. It is the most common measure of qualitative data. Researchers have to look in to the interval qualitative data before making any conclusion about data that they have been provided with. It shows the sequence of flow of data from one point to another. In other words, shows how data is correlated. It is used in constructing bar charts, pie charts and other statistical diagrams. Interval attributes provide sufficient data for order objects and are always represented by positive integers.

Attribute Name: HEAT

Attribute Type: Ordinal

Justification: Heat is classified as an ordinal attribute because of its nature. It can be classified as small, medium and large. This means that it is not a variable but rather a discrete data. It touches on all the aspects of measures of central tendency which include the mean, the mode and the median. It can be ranked making it a complete ordinal object. It is one of the major attributes that has to be looked into when researching about qualitative data.

Attribute Name: HEAT_D

Attribute Type: Ratio

Justification: A ratio attribute is a numeric attribute with an inherent zero-point. Example: A number of words attribute is a ratio attribute. – If a measurement is ratio-scaled, we can speak of a value as being a multiple (or ratio) of another value. The central tendency of an ratio attribute can be represented by its mode, its median (middle value in an ordered sequence), and it’s mean.

For ratio variables, both differences and ratios are meaningful. (*, /)

Data Collection

Comparison of the attributes

Together, nominal and ordinal characteristics are referred to as categorical or qualitative characteristics. – The majority of the features of numbers are absent from qualitative traits like employee ID. Although though they are numerically represented by integers, they should be viewed more like symbols. The median of values has little significance. Quantitative or numerical properties are referred to as interval and ratio together. The majority of the qualities of numbers apply to quantitative attributes, which are represented by numbers. Be aware that quantitative attributes can have continuous or integer values. – Operations on numbers, including mean and standard deviation, have meaning.

Only a finite or countable infinite set of values are available for discrete attributes. • Zip codes, occupations, or a group of words found in a group of documents- Integer variables are sometimes used to represent data. Recall that discrete attributes are a particular case of binary attributes. Asymmetric binary qualities are those that solely value non-zero values as important.

Real numbers are used to represent continuous attributes such as temperature, height, and weight. Real values can only be measured and expressed using a certain number of digits, practically speaking. The most common way to represent continuous qualities is with floating-point variables.

# Second way:

 Attribute Name Attribute Type Justification HEAT Ordinal represents discrete and ordered units HEAT-D Ratio a numeric attribute with an inherent zero-point HF-BATHRM Interval represents units that have numeric values BATHRM Nominal represents discrete

1. The summarizing properties for the attributes

## Frequency & Distribution Data Visualisations

 Statistics Value Mean 3.5 Median 4 Minimum Value 1.2 Maximum Value 4.8 Standard deviation 3.3 Variance 1.687

The pie chart that follows shows the binary values (0 and 1) that support the Quotation Flag property. It was stated in the literature that the Quotation Flag pertains to whether or not a customer has purchased a policy. By applying that reasoning to the data, “0” refers to customers who haven’t bought the policy, while “1” refers to customers who have. In accordance with the row count, 18.43% of the people who were listed as having a policy had one, whereas 81.57% had not. The frequency for “0” is approximately treble that of “1,” with 2447 selections as compared to 553, as shown by the frequency graph (see figure 2), which can be found in the text.

## 3 exploration

This scatter plot illustrates that the majority of the personal info3 nominal data was submitted six and seven years ago by using hierarchical clustering and estimating the year difference to the present. This dataset record’s most recent data collection was 5 years ago, although that data was less current in terms of personal info3 nominal data than it was 6 and 7 years ago. It has been eight years since the last nominal personal info3 data harvest. This not only shows a drop in the recording of personal info3 nominal data, but it also shows a decline in Quote IDs, effectively showing that fewer quotes were distributed five years ago.

The figure that follows gives the correlation of the two sets of data provided. The two sets of data are correlation the fact that before getting one you must get the other. One is used in calculation of the other. They are among the sets of data that one gets from the field during research.

The rank correlation and linear correlation chart below identifies the correlations between the different attributes that have been used in the discussions within this report.

Code system and test coding

Using a combination of deductive and inductive coding, the analysis was conducted stage-by-stage (also called “hybrid” coding, cf. Fereday & Muir-Cochrane 2006). The coding system (as well as the categories and topics that were created using the coding process) was created gradually and with collaboration. The concepts (the aforementioned benefit categories used, operationalized, and expanded in the BeLL study) to which the final connections between the codes, categories, and themes were made were established from the beginning and developed on theoretical premises (cf. overall final report).

This code system was built stage-by-stage through collaboration in the form of several (virtual or actual) workshops using the qualitative analysis of open survey questions as its starting point. whose objectives were to assess the usefulness, acceptability, and appropriateness of the code system. The concepts and codes that had been created earlier for the BeLL survey and through qualitative content analysis of the survey’s open-ended questions served as the starting points. All national teams were first given the order by WP5’s leader to conduct a first test coding on a British team interview using the initial code system. She gathered all test coding, discussed the results in a small working group11, and identified a number of issues to be discussed by all national teams of certain codes or categories, for instance, or suggestions for the inclusion of new codes, or for the renaming of codes). The code system was modified and supplemented13 based on this information, and recommendations for a second test coding were developed.

## Data processing

1. Binning techniques

## Equi-width binning.

• Equi-depth binning: do the same as equi-width

## 2.       Normalization

In a Min-max normalization

The purpose of normalization is to change the values of numeric columns in the dataset to use a common scale but preventing differences in the ranges of values or losing information.

z-score normalization: similar to min-max

## 3.       Discretization

Discretization is the process of converting a continuous attribute into an ordinal attribute to map the number of values into a small number of categories. In this case, discretization

Of the Coverage_Info1 attribute is to be done into the following categories: Basic, Low, Medium and High.

## 1.       Binarisation

The purpose of binarisation is to map a continuous or categorical attribute into one or multiple binary variables. To perform the binarisation in the Geographic_Info5 variable [with values “0” or “1”] the following steps were followed in KNIME:

## 1C. Summary

The most important findings of this report include the following:

• Coverage info1 attribute: the data spans from -1 to 25, however there are only 4 out of 3000 instances of “-1,” which is less frequent than all other values when comparing their frequencies, leading to a very unbalanced distribution of data.
• Unanswered and confidential – this could potentially explain the 45.87% of blank data/missing values in reference to their personal information. This could suggest that ‘-1’ indicates a data collecting error, making it an anomaly. If the data has missing values that need to be recovered, it is worth looking into further.
• The scatter plot shows that the years 6 and 7 years ago were when the majority of the sales info5 data that might possibly indicate monetary worth was provided by using hierarchical clustering and calculating the year difference to now. The pattern over the most recent year indicates a drop in the sales info5 data and, consequently, perhaps lower revenue from potential customers. In the future, this association has to be looked into and visually evaluated more carefully.
• When Personal info3 and Geographic info5 are plotted, the main cluster of data with a high degree of similarity is located around the letter “CA,” which might be a geographical location given the name of the attribute and the fact that residents of that state only chose the letter “ZA” for their data in Personal info3. This shows a strong correlation between the two attributes.
• To indicate which data values should be classified as outliers or noise, basic statistical descriptions can be employed to emphasize the features of the data. In order to do data preparation activities, we need to understand the characteristics of the data with relation to both central tendency and dispersion. The mean, median, mode, and midrange are all central tendency indicators. Quartiles, interquartile range (IQR), and variance are examples of data dispersion measures. Understanding how the data are distributed is greatly aided by these descriptive statistics.

• The arithmetic mean is the most popular and efficient way to measure the “center” of a set of data numerically. Mathematical Mean: Each value xi in a set may occasionally be given a weight. The weights represent the relevance and weight given to each value.
• One of the measures of central tendency is the mean. It is the average of the given dataset. It is achieved by getting the summation of the dataset divided by the number of items in the dataset. It is the most common measurer of central tendency and can be identified at any given point of time.
• The other measurer of central tendency is the median. It represents the inner most data of a population. In an odd data representation, it is evident as it is the data at the middle when the data set is arranged in ascending or descending order. For an even data, it is average of the two sets of data at the middle. It can never lack from dataset unlike the mode.
• # Mode is also one of the measures of central tendency. It is a representative of the most frequent data in the research data. It is therefore one of the most extinct measures of central tendency. It can either be in existence or fail to exist unlike the other measures of central tendency. There is no formulae used to calculate the mode but rather it’s a matter of observation.

The Modern Freelancer and Student's Workbook
Enter your email below to receive my Freelance Tips, free courses, research papers, Test banks, Books, journals and more.