Bonanza Offer FLAT 20% off & $20 sign up bonus Order Now
CI7340
UK
Kingston University
Data plays an important role in solving business, social, psychological, science, IT, etc. problems. Organizing, pre-processing, visualization and Analysis of raw data is important task to achieve the solution for problem statements. Data usually comes in unstructured, messy format; cleaning and pre-processing the data is very important part to reduce the inconsistency of data and in getting accurate results. The objective of this assignment is to implement the basic techniques of exploratory data analysis, data visualization, understanding the data and statistics of the data is the priority. Different data cleaning, pre-processing and visualization techniques exist in data science and statistics. This method helps us in visualizing and understanding the statistics of the data.
Aim of the research is
Preliminary data analysis is a step where researcher decides the steps for analysing the data to achieve the research problem statement’s solution scientifically.
For analysis of data there are many tools available in the market like R-programming language, SPSS, STATA, KNIME, TABLEAU, etc. but python is the programming language that is used mostly for data science projects or in statistical analysis, this language is easy to use, free tool. Python has the many other IDLE like ‘Pycharm’, ‘Spyder’, jupyter notebook, jupyter lab; these idle used for different purposes for data analysis and data manipulation the jupyter is the best source in python to be used. Online google colab is also available and that is the clone of the jupyter notebook. In this assignment Jupyter notebook has been used to solve the problem.
Unpingco explained, in python, panda’s library is very efficient tools in data manipulation; matplotlib and seaborn are the visualization tool that helps in visualizing the data effectively. Numpy is used for mathematical computation of algorithm. Numpy is useful for arrays, vectors, matrix functions or many more. To use any statistical algorithm if we use only numpy library then it will be very time consuming which the disadvantage of numpy because numpy built the algorithm from scratch and loading data is also not easy as pandas library.So , for big mathematical computation other libraries are there like, ‘statsmodels’, ‘scipy’, ‘scikit-learn’, in all these libraries numpy worked as backend, so loading numpy library is necessary.
Chen explained that for creating pie chart groupby function is used; for merging the two dataset, merge function will be used using pandas. Through pandas data manipulation gets easy.
Assignment has two dataset named as ‘inputfeatures.csv’ and ‘targetvalues.csv’. Both the data has total 41 variables, inputfeatures.csv has 39 variables in the data and targetvalues.csv has two variables in the data, one variable named as ‘Building_id’ variable is common and it is unique and random ID. Inputfeatures.csv has the feature variables and targetvariables.csv has target variables. Except building_id and target variable, there is 38 variables.
Variables | Description | Data Type |
geo_level_1_id, geo_level_2_id, geo_level_3_id | Building that exists within geographic regions, from largest region to most specified sub-region.Level1: 0-30, level2: 0-1427,level3:0:12567 | Integer |
Count_floors_pre_eq | Before earthquake number of floors in the building. | Integer |
Age | Building’s age in years. | integer |
Area_precentage | Footprint of building from normalized area. | Integer |
Height_percentage | Footprint of building from normalized height. | Integer |
Land_surface_condition | Land condition of the built building. Categories: ‘n’, ‘o’, ‘t’. | Categorical |
Foundation_type | While building type of foundation used. Categories: ‘h’,’i’, ‘r’, ‘u’, ‘w’. | Categorical |
roof_type | While building used roof type. Categories: ‘n’, ‘q’, ‘x’. | Categorical |
ground_floor_type | Ground floor type. Categories: ‘f’, ‘m’,’v’, ‘x’,’z’. | Categorical |
other_floor_type | Construction type used higher than ground floors. Categories: ‘j’, ‘q’, ‘s’, ‘x’. | Categorical |
Position | Building position. Categories:’j’, ‘o’, ‘s’,’t’. | Categorical |
Plan_configuration | Plan configuration of building. Categories: ‘a’, ‘c’, ‘d’, ‘f’, ‘m’, ‘n’, ‘o’, ‘n’, ‘o’, ‘q’, ‘s’, ‘u’. | Categorical |
has_superstructure_adobe_mud | Superstructure of adobe/mud made or not. (Yes:1, NO:0) | binary |
Has_superstructure_mud_mortar_stone | Superstructure of building mud mortar-stone made or not. (Yes:1, NO:0) | binary |
Has_superstructure_stone_flag | Superstructure of building stone made or not. (Yes:1, NO:0) | Binary |
Has_superstructure_cement_mortar_stone | Superstructure of building cement mortar-stone made or not. (Yes:1, NO:0) | Binary |
Has_superstructure_mud_mortar_brick | Superstructure of building mud mortar-bricked or not. (Yes:1, NO:0) | Binary |
Has_superstructure_cement_mortar_brick | Superstructure of building cement mortar bricked or not. (Yes:1, NO:0) | Binary |
Has_superstructure_timber | Superstructure of building timber made or not. (Yes:1, NO:0) | Binary |
Has_superstructure_bamboo | Superstructure of building bamboo made or not. (Yes:1, NO:0) | binary |
Has_superstructure_rc_non_engineered | Superstructure made from non-engineered reinforced concrete or not.(Yes:1, NO:0) | binary |
Has_superstructure_rc_engineered | Superstructure made from engineered reinforced concrete or not.(Yes:1, NO:0) | binary |
Has_superstructure_other | Superstructure of building made from other material or not (Yes:1, No:0). | binary |
Legeal_ownership_status | Variable indicates about the built land’s legal ownership status. Categories: ‘a’, ‘r’, ‘v’, ‘w’. | Category |
Count_families | Total number of families staying in building. | integer |
Has_Secondary_use | Was building used for any other purpose or not. (yes:1, No:0) | binary |
Has_secondary_use_agriculture | Was building used for agriculture or not (yes:1, No:0) | Binary |
Has_secondary_use_hotel | Was building used for hotel or not (yes:1, No:0) | Binary |
Has_secondary_use_rental | Was building used for rental purpose or not. (Yes:1, No:0) | binary |
Has_secondary_use_institution | Was building used for institution purpose or not (Yes:1, No:0) | binary |
Has_secondary_use_school | Was building used for school or not (Yes:1, NO:0) | binary |
Has_secondary_use_industry | Was building used for industry or not (Yes:1, NO:0) | binary |
Has_secondary_use_health_post | Was building used for health purpose or not. (Yes:1, NO:0). | binary |
Has_secondary_use_gov_office | Was building used for government office purpose or not (Yes:1, NO:0) | binary |
Has_secondary_use_use_police | Was building used for police use purpose or not (Yes:1, No: 0) | binary |
Has_secondary_use_other | Was building used for other purpose or not. (Yes:1, NO:0) | binary |
Table 1.1: Target Values data
Variable names | Description | Data type |
Buildind_id | Unique and Random ID | Integer |
Damage_grade | Level of building damages caused due to earthquake. Category: ‘1’: Low Damage, ‘2’: Medium Damage, ‘3’: Complete damage. | Category |
Shmueli et., al, explained that initial data analysis consists of data cleaning and data preprocessing. In this step we make the data ready for analysis through data pre-processing [7].Data has no missing information in both the dataset (input features.csv and target.csv). All the variables in the dataset have the correct data type so we don’t need to use type conversion technique on data. As target.csv data contains dependent variables this variable depends on other features of data_input variables, so merging of the two datasets is require. Common variable exists in both the dataset is ‘building_id’, by using building_id variable two dataset can be easily merged. After merging two dataset,Variable’s 'building_id','geo_level_1_id', 'geo_level_2_id', 'geo_level_3_id' has no use in analysis, so these variables are dropped using drop() function [6].
All over data quality is good enough. Data has some Categorical features and binary features. Binary features are already encoded in the data by 1 and 0, but needed to convert categorical data to numerical data so that these variables can use for modelling purpose or we can implement any other statistical algorithm easily on these categories. Categorical in the data are ‘legal_ownership_status’, ‘land_surface_condition’, ‘foundation_type’, ‘roof_type’, ‘ground_floor_type’, ‘other_floor_type’, ‘position’, ‘plan_configuration’, ‘has_superstructure_adobe_mud’, and ‘has_superstructure_mud_mortar_stone’. For converting or transforming categorical to numerical data, method exists named as label encoder, one hot encoder.
These two methods can be implement on the categorical variable by using python library scikit-learn also known as sklearn (Embarak, and Karkal). Label encoder convert the categories to numerical data while one hot encoder also does the same but one hot encoder method creates the dummy variables and this method leads to increase the number of variables, sometimes this method is effective if it is required to implement the data on any statistical or machine learning model [18].It provides the very good result as compare to label encoder; conversion of this categorical variable to numerical variable is not required for visualization technique, this method can be implemented after visualization for next step.
EDA is the exploratory data analysis, which is used for exploring the data. It helps in summarizing the main properties of the dataset. By using graphical or numerical statistics one can explore the dataset very well and can make required assumptions, EDA helps in taking require action for data so that data can be ready for analysis. Exploratory data analysis is different from preliminary data analysis. In preliminary or initial data analysis pre-process of the data is made while in exploratory data analysis visualization, identifying patterns, grouping the data, summary of the data that explains the statistics of the cleaned data. For plotting some graph like pie chart, one can use groupby() function to plot the pie chart and can plot the effective chart easily using matplotlib library and groupby() function.
Descriptive Statistics explains the mean, median, mode, minimum value of the data, maximum value of the data, Quartiles of the data[12]. This information helps us in understanding all statistics of the data. Mean, median and mode (frequent value) will help us in understanding the average or more frequent numbers or categories in the data. Histogram, bar plot, count plot, correlation plot, pie chart are the some visualization technique that will help in answering the research question.
Haden explains that mean is the average value of continuous and discrete quantitative data. Mean of the data will identify the optimum age of the building and Minimum age and maximum age of the building helps in identifying the age of recent built new buildings and also to understand the most aged buildings; identifying this will answer the research question 7. Mean, minimum and maximum values of descriptive statistics for continuous and discrete quantitative data helps in understanding the average, minimum and maximum number of families in the building and this will lead to understand how many lives were in danger during the earthquake (Explains the research Question 2).
Mode is the most frequent value of the Qualitative and Quantitative both the data type. Mode value will help in understanding the most of the most frequent age of the building and by using damage_ grade statistics we can identify whether the most frequent age of the buildings completely damaged or not; or what level of damage has been caused during the earthquake (Explains the research Question 3). Mode is also helps in identifying which roof type, Foundation type, ground floor type or any other floor type is most frequently used in the building, this answers the half part of research question 8. Secondary purpose of the building use variables are binary in nature so identifying mean or any other statistics is not useful, this will not answer the research Question, only through visualization technique, research questions on secondary purpose of the building can be answered. Rest of the research questions can be answered through visualization technique only.
Wang et.,al, explained that Statistics has many visualization techniques that helps in summarizing the data in virtually and helps in identifying the pattern of the data. Histogram is the chart that helps in identifying the distribution of the data and type of the data; mean, median or mode values of the data can be visualize by using the histogram.
Histogram: Kim et.,al, explained that histogram plot is the plot that identifies the distribution and nature of the data (Features) and also it explains the mean, median and mode of the data (Features). With the help of the histogram research question 3 and question 4 or large number of damaged buildings that can be used for secondary purpose can be answered i.e most caused level of damage to the buildings during the earthquake. Histogram work as visualization for descriptive statistics, with the help of the histogram, nature of the data can be recognizable. With the help of matplotlib library one can plot the histogram of all the numerical and binary data.
Bar plot: Bar plot is the plot which represents the categorical data with rectangular bars which represents the lengths and heights that are the proportional to the disecrete or continous quantitative variables values which is represented by the categorial data. This can also be plotted horizontally as well as vertically[13]. Bar chart made the comparison between different categories. On X-axis categorical data is plotted while in Y-axis discrete or continous quantiative features measures are represnted. Bar chart helps in identifying the loss of human lives by using the count of families and completes damage of buildings (Explain research Question 2 and Question 3). Bar chart helps in identifying whether damaged building is used for secondary use or not,and what level of damage has been caused to the buildings; explains the half part of research question 5.
Count plot: Count plot is the plot for the categorical data, it basically counts the number of times categories observed in the available data or features. This plot explains the Research question 5 and question 6 that is how many buildings are used more for secondary purpose to identify what kind of loss has been done after the earthquake and which roof type, Foundation type, ground floor type or any other floor type is the best that causes the less damage of buildings after the earthquake.
Correlation Matrix plot: Correlation matrix plot is the plot that helps in identifying the important features related to the target variables and also it identifies the relation between different features. Chung et.,al, explained that the relation between features and target variable can be negative and positive that defines the impact or the relation. This method also helps in identifying the multicollinearity issue in the data between the features. This plot helps in answering the research question 1 that is which features are most responsible for causing the damage of buildings (Target variable).
Scatter plot: Scatter plot is the plot between two features, this explains the relationship between the two features, same as correlation matrix does[16]. Scatter plot just more clarifies about the features direction and scatter plot can also help in answering research question 1.
Pie chart: Pie chart is the chart that helps in visualizing the proportion of the categorical data in form of percentage or counts. It is calculated as number of categories in the features divided by the total number of observation in the features [16].To solve the research question 8 we can use groupby function and plot the piechart using matplotlib library. Pie chart will help in answering the research Question 8 that is whether Ground floor type, other floor type, roof type, Position, land surface condition, foundation type, legal ownership status any of them are responsible for cause of building damage and in what percentage it is involved.
As per above discussion most of the research questions we can answer just with the help of descriptive statistics and visualization techniques and these methodologies will help in for making decision without using any advanced statistical or machine learning theory. A descriptive statistics and visualization technique also helps in deciding the further techniques that can be used for further analysis. Data cleaning and data pre-processing techniques are helpful in making quality of the data better, so we can get better and almost accurate results for the research questions. A jupyter notebook is the good and easy programming language for detailed descriptive statistics and visualization techniques.
[1]Wang Z, Sundin L, Murray-Rust D, Bach B. Cheat sheets for data visualization techniques. InProceedings of the 2020 CHI Conference on Human Factors in Computing Systems 2020 Apr 21 (pp. 1-13).
[2]Toasa R, Maximiano M, Reis C, Guevara D. Data visualization techniques for real-time information—A custom and dynamic dashboard for analyzing surveys' results. In2018 13th Iberian Conference on Information Systems and Technologies (CISTI) 2018 Jun 13 (pp. 1-7). IEEE.
[3]StanÄin I, Jović A. An overview and comparison of free Python libraries for data mining and big data analysis. In2019 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) 2019 May 20 (pp. 977-982). IEEE.
[4]Sahoo K, Samal AK, Pramanik J, Pani SK. Exploratory data analysis using Python. International Journal of Innovative Technology and Exploring Engineering (IJITEE). 2019 Oct;8(12):2019.
[5]Purohit K. Separation of Data Cleansing Concept from EDA. International Journal of Data Science and Analysis. 2021 Jun 22;7(3):89.
[6]Chen DY. Pandas for everyone: Python data analysis. Addison-Wesley Professional; 2017 Dec 15.
[7]Kumar V, Khosla C. Data Cleaning-A thorough analysis and survey on unstructured data. In2018 8th International Conference on Cloud Computing, Data Science & Engineering (Confluence) 2018 Jan 11 (pp. 305-309). IEEE.
[8]Embarak DO, Embarak, Karkal. Data analysis and visualization using python. Apress; 2018.
[9]Kaur P, Stoltzfus J, Yellapu V. Descriptive statistics. International Journal of Academic Medicine. 2018 Jan 1;4(1):60.
[10]Vetter TR. Descriptive statistics: Reporting the answers to the 5 basic questions of who, what, why, when, where, and a sixth, so what?. Anesthesia & Analgesia. 2017 Nov 1;125(5):1797-802.
[11]Conner B, Johnson E. Descriptive statistics: Use these tools to analyze data vital to practice-improvement projects. American Nurse Today. 2017 Nov 1;12(11):52-6.
[12]Haden P. 5 Descriptive Statistics. The Cambridge Handbook of Computing Education Research, Sally A. Fincher and Anthony V. Robins (Eds.). Cambridge University Press, Cambridge. 2019 Feb 13:102-32.
[13]Kim J, Kim M, Yu H, Kim Y, Kim J. Effect of data visualization education with using Python on computational thinking of six grade in elementary school. Journal of The Korean Association of Information Education. 2019;23(3):197-206.
[14]Shmueli G, Bruce PC, Gedeck P, Patel NR. Data mining for business analytics: concepts, techniques and applications in Python. John Wiley & Sons; 2019 Oct 14.
[15]Sarka D. Descriptive statistics. InAdvanced Analytics with Transact-SQL 2021 (pp. 3-29). Apress, Berkeley, CA.
[16]Chung J, Pedigo BD, Bridgeford EW, Varjavand BK, Helm HS, Vogelstein JT. GraSPy: Graph Statistics in Python. J. Mach. Learn. Res.. 2019 Jan 1;20:158-.
[17]Vallat R. Pingouin: statistics in Python. Journal of Open Source Software. 2018 Nov 19;3(31):1026.
[18]Ivezić Ž, Connolly AJ, VanderPlas JT, Gray A. Statistics, data mining, and machine learning in astronomy: A practical python guide for the analysis of survey data. Princeton University Press; 2019 Dec 3.
[19]Oleinik K. Python for Data Analysis.
[20]Unpingco J. Pandas. InPython Programming for Data Analysis 2021 (pp. 127-156). Springer, Cham.
We provide unmatched quality assignment writing services for all subjects in your curriculum. The most common subjects we cover include Maths, English, Humanities, Social Sciences, Management and more. Our experts have acquired their respective PhDs in specific disciplines from reputed universities. Tell us the subject you need ghostwriters for. We will assign a suitable ghostwriter accordingly. We are not restricted to writing assignments only. We also provide top-notch proofreading services and essay writing services. The essay writers are well-versed in all types of essays such as persuasive, expository, narrative, argumentative, persuasive, etc. Whether you need math homework help or English homework help, we have the right expert for you.
Upload your Assignment and improve Your Grade
Boost Grades