CI7340 Applied Data Programming

  • Subject Code :  

    CI7340

  • Country :  

    UK

  • University :  

    Kingston University

Answer:

Introduction

Data plays an important role in solving business, social, psychological, science, IT, etc. problems. Organizing, pre-processing, visualization and Analysis of raw data is important task to achieve the solution for problem statements. Data usually comes in unstructured, messy format; cleaning and pre-processing the data is very important part to reduce the inconsistency of data and in getting accurate results. The objective of this assignment is to implement the basic techniques of exploratory data analysis, data visualization, understanding the data and statistics of the data is the priority. Different data cleaning, pre-processing and visualization techniques exist in data science and statistics. This method helps us in visualizing and understanding the statistics of the data.

Research Questions:

Aim of the research is

  1. To understand which features are most responsible for causing the damage of buildings.
  2. To identify the loss of human lives by using the count of families and complete damage of buildings.
  3. To understand, which level of damage caused mostly to buildings due to earthquake.
  4. To identify whether large number of damaged buildings used for the secondary purpose or not.
  5. To identify which sector or which secondary purpose of using building is high in number, how many buildings are used more for secondary purpose to identify what kind of loss has been done after earthquake?
  6. To understand which roof type, Foundation type, ground floor type or any other floor type is the best that causes the less damage of buildings after the earthquake.
  7. To identify that which age of the building caused the complete damage, to understand is more age of the building leads to complete damage after the earthquake or not.
  8. To understand whether Ground floor type, other floor type, roof type, Position,land surface condition, foundation type, legal ownership status any of them are responsible for cause of building damage and in what percentage it is involved.

Preliminary Data Analysis

Preliminary data analysis is a step where researcher decides the steps for analysing the data to achieve the research problem statement’s solution scientifically.

Programming Language and Tools

For analysis of data there are many tools available in the market like R-programming language, SPSS, STATA, KNIME, TABLEAU, etc. but python is the programming language that is used mostly for data science projects or in statistical analysis, this language is easy to use, free tool. Python has the many other IDLE like ‘Pycharm’, ‘Spyder’, jupyter notebook, jupyter lab; these idle used for different purposes for data analysis and data manipulation the jupyter is the best source in python to be used. Online google colab is also available and that is the clone of the jupyter notebook. In this assignment Jupyter notebook has been used to solve the problem.

Advantage and Disadvantage of Tools

  • Google colab takes less memory of computer and colab has its own virtual space and it is a free open source tool, so this is the advantage of using Google colab but it consumes the more internets in loading data and processing the whole analysis.
  • SPSS is easy to use but not free software relatively very expensive, Machine learning algorithm can be implemented easily and as compare to other programming Languages it is very easy to use.
  •  TABLEAU is a visualization tool, easy to use and provide very effective graphs and can create the story through dashboard, but machine learning and statistical methods cannot be implemented in tableau.
  • R is an free open source continuously growing tool, highly compatible, machine learning, deep learning algorithm can easily be implemented using R. However handling data is not easy in R, it consumes more memeory, R cannot load and manage very big amount of data.R language is much slower than other languages.
  • Jupyter notebook is platform and language independent, as this is representable in JSON format and notebook is easily processed by any programming language and can be converted to any file formats like HTML,PDF, Markdown, .Py, etc. However, this notebook is very less secure and No IDE integration exists in jupyter notebook [19].

Unpingco explained, in python, panda’s library is very efficient tools in data manipulation; matplotlib and seaborn are the visualization tool that helps in visualizing the data effectively.  Numpy is used for mathematical computation of algorithm. Numpy is useful for arrays, vectors, matrix functions or many more. To use any statistical algorithm if we use only numpy library then it will be very time consuming which the disadvantage of numpy because numpy built the algorithm from scratch and loading data is also not easy as pandas library.So , for big mathematical computation other libraries are there like, ‘statsmodels’, ‘scipy’, ‘scikit-learn’, in all these libraries numpy worked as backend, so loading numpy library is necessary.

Chen explained that for creating pie chart groupby function is used; for merging the two dataset, merge function will be used using pandas. Through pandas data manipulation gets easy.  

Dataset

Assignment has two dataset named as ‘inputfeatures.csv’ and ‘targetvalues.csv’. Both the data has total 41 variables, inputfeatures.csv has 39 variables in the data and targetvalues.csv has two variables in the data, one variable named as ‘Building_id’ variable is common and it is unique and random ID.  Inputfeatures.csv has the feature variables and targetvariables.csv has target variables. Except building_id and target variable, there is 38 variables.

Table 1.0: Input features data

Variables

Description

Data Type

geo_level_1_id, geo_level_2_id, geo_level_3_id

Building that exists within geographic regions, from largest region to most specified sub-region.Level1: 0-30, level2: 0-1427,level3:0:12567

Integer

Count_floors_pre_eq

Before earthquake number of floors in the building.

Integer

Age

Building’s age in years.

integer

Area_precentage

Footprint of building from normalized area.

Integer

Height_percentage

Footprint of building from normalized height.

Integer

Land_surface_condition

Land condition of the built building. Categories: ‘n’, ‘o’, ‘t’.

Categorical

Foundation_type

While building type of foundation used. Categories: ‘h’,’i’, ‘r’, ‘u’, ‘w’.

Categorical

roof_type

While building used roof type. Categories: ‘n’, ‘q’, ‘x’.

Categorical

ground_floor_type

Ground floor type. Categories: ‘f’, ‘m’,’v’, ‘x’,’z’.

Categorical

other_floor_type

Construction type used higher than ground floors. Categories: ‘j’, ‘q’, ‘s’, ‘x’.

Categorical

Position

Building position. Categories:’j’, ‘o’, ‘s’,’t’.

Categorical

Plan_configuration

Plan configuration of building. Categories: ‘a’, ‘c’, ‘d’, ‘f’, ‘m’, ‘n’, ‘o’, ‘n’, ‘o’, ‘q’, ‘s’, ‘u’.

Categorical

has_superstructure_adobe_mud

Superstructure of adobe/mud made or not. (Yes:1, NO:0)

binary

Has_superstructure_mud_mortar_stone

Superstructure of building mud mortar-stone made or not. (Yes:1, NO:0)

binary

Has_superstructure_stone_flag

Superstructure of building stone made or not. (Yes:1, NO:0)

Binary

Has_superstructure_cement_mortar_stone

Superstructure of building cement mortar-stone made or not. (Yes:1, NO:0)

Binary

Has_superstructure_mud_mortar_brick

Superstructure of building mud mortar-bricked or not. (Yes:1, NO:0)

Binary

Has_superstructure_cement_mortar_brick

Superstructure of building cement mortar bricked or not. (Yes:1, NO:0)

Binary

Has_superstructure_timber

Superstructure of building timber made or not. (Yes:1, NO:0)

Binary

Has_superstructure_bamboo

Superstructure of building bamboo made or not. (Yes:1, NO:0)

binary

Has_superstructure_rc_non_engineered

Superstructure made from non-engineered reinforced concrete or not.(Yes:1, NO:0)

binary

Has_superstructure_rc_engineered

Superstructure made from engineered reinforced concrete or not.(Yes:1, NO:0)

binary

Has_superstructure_other

Superstructure of building made from other material or not (Yes:1, No:0).

binary

Legeal_ownership_status

Variable indicates about the built land’s legal ownership status. Categories: ‘a’, ‘r’, ‘v’, ‘w’.

Category

Count_families

Total number of families staying in building.

integer

Has_Secondary_use

Was building used for any other purpose or not. (yes:1, No:0)

binary

Has_secondary_use_agriculture

Was building used for agriculture or not (yes:1, No:0)

Binary

Has_secondary_use_hotel

Was building used for hotel or not (yes:1, No:0)

Binary

Has_secondary_use_rental

Was building used for rental purpose or not. (Yes:1, No:0)

binary

Has_secondary_use_institution

Was building used for institution purpose or not (Yes:1, No:0)

binary

Has_secondary_use_school

Was building used for school or not (Yes:1, NO:0)

binary

Has_secondary_use_industry

Was building used for industry or not (Yes:1, NO:0)

binary

Has_secondary_use_health_post

Was building used for health purpose or not. (Yes:1, NO:0).

binary

Has_secondary_use_gov_office

Was building used for government office purpose or not (Yes:1, NO:0)

binary

Has_secondary_use_use_police

Was building used for police use purpose or not (Yes:1, No: 0)

binary

Has_secondary_use_other

Was building used for other purpose or not. (Yes:1, NO:0)

binary

Table 1.1: Target Values data

Variable names

Description

Data type

Buildind_id

Unique and Random ID

Integer

Damage_grade

Level of building damages caused due to earthquake. Category: ‘1’: Low Damage, ‘2’: Medium Damage, ‘3’: Complete damage.

Category

Initial Data Analysis

Shmueli et., al, explained that initial data analysis consists of data cleaning and data preprocessing. In this step we make the data ready for analysis through data pre-processing [7].Data has no missing information in both the dataset (input features.csv and target.csv). All the variables in the dataset have the correct data type so we don’t need to use type conversion technique on data. As target.csv data contains dependent variables this variable depends on other features of data_input variables, so merging of the two datasets is require. Common variable exists in both the dataset is ‘building_id’, by using building_id variable two dataset can be easily merged. After merging two dataset,Variable’s 'building_id','geo_level_1_id', 'geo_level_2_id', 'geo_level_3_id' has no use in analysis, so these variables are dropped using drop() function [6].

All over data quality is good enough. Data has some Categorical features and binary features. Binary features are already encoded in the data by 1 and 0, but needed to convert categorical data to numerical data so that these variables can use for modelling purpose or we can implement any other statistical algorithm easily on these categories.  Categorical in the data are ‘legal_ownership_status’, ‘land_surface_condition’, ‘foundation_type’, ‘roof_type’, ‘ground_floor_type’, ‘other_floor_type’, ‘position’, ‘plan_configuration’, ‘has_superstructure_adobe_mud’,  and ‘has_superstructure_mud_mortar_stone’. For converting or transforming categorical to numerical data, method exists named as label encoder, one hot encoder.

These two methods can be implement on the categorical variable by using python library scikit-learn also known as sklearn (Embarak, and Karkal). Label encoder convert the categories to numerical data while one hot encoder also does the same but one hot encoder method creates the dummy variables and this method leads to increase the number of variables, sometimes this method is effective if it is required to implement the data on any statistical or machine learning model [18].It provides the very good result as compare to label encoder; conversion of this categorical variable to numerical variable is not required for visualization technique, this method can be implemented after visualization for next step.

Introduction to EDA

EDA is the exploratory data analysis, which is used for exploring the data. It helps in summarizing the main properties of the dataset. By using graphical or numerical statistics one can explore the dataset very well and can make required assumptions, EDA helps in taking require action for data so that data can be ready for analysis. Exploratory data analysis is different from preliminary data analysis. In preliminary or initial data analysis pre-process of the data is made while in exploratory data analysis visualization, identifying patterns, grouping the data, summary of the data that explains the statistics of the cleaned data. For plotting some graph like pie chart, one can use groupby() function to plot the pie chart and can plot the effective chart easily using matplotlib library and groupby() function.

Descriptive Statistics

Descriptive Statistics explains the mean, median, mode, minimum value of the data, maximum value of the data, Quartiles of the data[12]. This information helps us in understanding all statistics of the data. Mean, median and mode (frequent value) will help us in understanding the average or more frequent numbers or categories in the data. Histogram, bar plot, count plot, correlation plot, pie chart are the some visualization technique that will help in answering the research question.

Haden explains that mean is the average value of continuous and discrete quantitative data. Mean of the data will identify the optimum age of the building and Minimum age and maximum age of the building helps in identifying the age of recent built new buildings and also to understand the most aged buildings; identifying this will answer the research question 7. Mean, minimum and maximum values of descriptive statistics for continuous and discrete quantitative data helps in understanding the average, minimum and maximum number of families in the building and this will lead to understand how many lives were in danger during the earthquake (Explains the research Question 2).

Mode is the most frequent value of the Qualitative and Quantitative both the data type. Mode value will help in understanding the most of the most frequent age of the building and by using damage_ grade statistics we can identify whether the most frequent age of the buildings completely damaged or not; or what level of damage has been caused during the earthquake (Explains the research Question 3). Mode is also helps in identifying which roof type, Foundation type, ground floor type or any other floor type is most frequently used in the building, this answers the half part of research question 8. Secondary purpose of the building use variables are binary in nature so identifying mean or any other statistics is not useful, this will not answer the research Question, only through visualization technique, research questions on secondary purpose of the building can be answered.  Rest of the research questions can be answered through visualization technique only.

Data Visualization

Wang et.,al, explained that Statistics has many visualization techniques that helps in summarizing the data in virtually and helps in identifying the pattern of the data. Histogram is the chart that helps in identifying the distribution of the data and type of the data; mean, median or mode values of the data can be visualize by using the histogram.

Histogram: Kim et.,al, explained that histogram plot is the plot that identifies the distribution and nature of the data (Features) and also it explains the mean, median and mode of the data (Features). With the help of the histogram research question 3 and question 4 or large number of damaged buildings that can be used for secondary purpose can be answered i.e most caused level of damage to the buildings during the earthquake. Histogram work as visualization for descriptive statistics, with the help of the histogram, nature of the data can be recognizable. With the help of matplotlib library one can plot the histogram of all the numerical and binary data.

Bar plot: Bar plot is the plot which represents the categorical data with rectangular bars which represents the lengths and heights that are the proportional to the disecrete or continous quantitative variables values which is represented by the categorial data. This can also be plotted horizontally as well as vertically[13].  Bar chart made the comparison between different categories. On X-axis categorical data is plotted while in Y-axis discrete or continous quantiative features measures are represnted. Bar chart helps in identifying the loss of human lives by using the count of families and completes damage of buildings (Explain research Question 2 and Question 3). Bar chart helps in identifying whether damaged building is used for secondary use or not,and what level of damage has been caused to the buildings;  explains the half part of research question 5.

Count plot: Count plot is the plot for the categorical data, it basically counts the number of times categories observed in the available data or features. This plot explains the Research question 5 and question 6 that is how many buildings are used more for secondary purpose to identify what kind of loss has been done after the earthquake and which roof type, Foundation type, ground floor type or any other floor type is the best that causes the less damage of buildings after the earthquake.

Correlation Matrix plot: Correlation matrix plot is the plot that helps in identifying the important features related to the target variables and also it identifies the relation between different features. Chung et.,al, explained that the relation between features and target variable can be negative and positive that defines the impact or the relation. This method also helps in identifying the multicollinearity issue in the data between the features.  This plot helps in answering the research question 1 that is which features are most responsible for causing the damage of buildings (Target variable).

Scatter plot: Scatter plot is the plot between two features, this explains the relationship between the two features, same as correlation matrix does[16]. Scatter plot just more clarifies about the features direction and scatter plot can also help in answering research question 1.

Pie chart: Pie chart is the chart that helps in visualizing the proportion of the categorical data in form of percentage or counts. It is calculated as number of categories in the features divided by the total number of observation in the features [16].To solve the research question 8 we can use groupby function and plot the piechart using matplotlib library. Pie chart will help in answering the research Question 8 that is whether Ground floor type, other floor type, roof type, Position, land surface condition, foundation type, legal ownership status any of them are responsible for cause of building damage and in what percentage it is involved.

Conclusion

As per above discussion most of the research questions we can answer just with the help of descriptive statistics and visualization techniques and these methodologies will help in for making decision without using any advanced statistical or machine learning theory. A descriptive statistics and visualization technique also helps in deciding the further techniques that can be used for further analysis. Data cleaning and data pre-processing techniques are helpful in making quality of the data better, so we can get better and almost accurate results for the research questions. A jupyter notebook is the good and easy programming language for detailed descriptive statistics and visualization techniques.

References

[1]Wang Z, Sundin L, Murray-Rust D, Bach B. Cheat sheets for data visualization techniques. InProceedings of the 2020 CHI Conference on Human Factors in Computing Systems 2020 Apr 21 (pp. 1-13).

[2]Toasa R, Maximiano M, Reis C, Guevara D. Data visualization techniques for real-time information—A custom and dynamic dashboard for analyzing surveys' results. In2018 13th Iberian Conference on Information Systems and Technologies (CISTI) 2018 Jun 13 (pp. 1-7). IEEE.

[3]Stančin I, Jović A. An overview and comparison of free Python libraries for data mining and big data analysis. In2019 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) 2019 May 20 (pp. 977-982). IEEE.

[4]Sahoo K, Samal AK, Pramanik J, Pani SK. Exploratory data analysis using Python. International Journal of Innovative Technology and Exploring Engineering (IJITEE). 2019 Oct;8(12):2019.

[5]Purohit K. Separation of Data Cleansing Concept from EDA. International Journal of Data Science and Analysis. 2021 Jun 22;7(3):89.

[6]Chen DY. Pandas for everyone: Python data analysis. Addison-Wesley Professional; 2017 Dec 15.

[7]Kumar V, Khosla C. Data Cleaning-A thorough analysis and survey on unstructured data. In2018 8th International Conference on Cloud Computing, Data Science & Engineering (Confluence) 2018 Jan 11 (pp. 305-309). IEEE.

[8]Embarak DO, Embarak, Karkal. Data analysis and visualization using python. Apress; 2018.

[9]Kaur P, Stoltzfus J, Yellapu V. Descriptive statistics. International Journal of Academic Medicine. 2018 Jan 1;4(1):60.

[10]Vetter TR. Descriptive statistics: Reporting the answers to the 5 basic questions of who, what, why, when, where, and a sixth, so what?. Anesthesia & Analgesia. 2017 Nov 1;125(5):1797-802.

[11]Conner B, Johnson E. Descriptive statistics: Use these tools to analyze data vital to practice-improvement projects. American Nurse Today. 2017 Nov 1;12(11):52-6.

[12]Haden P. 5 Descriptive Statistics. The Cambridge Handbook of Computing Education Research, Sally A. Fincher and Anthony V. Robins (Eds.). Cambridge University Press, Cambridge. 2019 Feb 13:102-32.

[13]Kim J, Kim M, Yu H, Kim Y, Kim J. Effect of data visualization education with using Python on computational thinking of six grade in elementary school. Journal of The Korean Association of Information Education. 2019;23(3):197-206.

[14]Shmueli G, Bruce PC, Gedeck P, Patel NR. Data mining for business analytics: concepts, techniques and applications in Python. John Wiley & Sons; 2019 Oct 14.

[15]Sarka D. Descriptive statistics. InAdvanced Analytics with Transact-SQL 2021 (pp. 3-29). Apress, Berkeley, CA.

[16]Chung J, Pedigo BD, Bridgeford EW, Varjavand BK, Helm HS, Vogelstein JT. GraSPy: Graph Statistics in Python. J. Mach. Learn. Res.. 2019 Jan 1;20:158-.

[17]Vallat R. Pingouin: statistics in Python. Journal of Open Source Software. 2018 Nov 19;3(31):1026.

[18]Ivezić Ž, Connolly AJ, VanderPlas JT, Gray A. Statistics, data mining, and machine learning in astronomy: A practical python guide for the analysis of survey data. Princeton University Press; 2019 Dec 3.

[19]Oleinik K. Python for Data Analysis.

[20]Unpingco J. Pandas. InPython Programming for Data Analysis 2021 (pp. 127-156). Springer, Cham.

We provide unmatched quality assignment writing services for all subjects in your curriculum. The most common subjects we cover include Maths, English, Humanities, Social Sciences, Management and more. Our experts have acquired their respective PhDs in specific disciplines from reputed universities. Tell us the subject you need ghostwriters for. We will assign a suitable ghostwriter accordingly. We are not restricted to writing assignments only. We also provide top-notch proofreading services and essay writing services. The essay writers are well-versed in all types of essays such as persuasive, expository, narrative, argumentative, persuasive, etc. Whether you need math homework help or English homework help, we have the right expert for you.

Why Student Prefer Us ?
Top quality papers

We do not compromise when it comes to maintaining high quality that our customers expect from us. Our quality assurance team keeps an eye on this matter.

100% affordable

We are the only company which offers qualitative and custom assignment writing services at low prices. Our charges will not burn your pocket.

Timely delivery

We never delay to deliver the assignments. We are very particular about this. We assure that you will receive your paper on the promised date.

Round the clock support

We assure 24/7 live support. Our customer care executives remain always online. You can call us anytime. We will resolve your issues as early as possible.

Privacy guaranteed

We assure 100% confidentiality of all your personal details. We will not share your information. You can visit our privacy policy page for more details.

Upload your Assignment and improve Your Grade

Boost Grades