Improving the correlation hunting in a largequantity of SOM component planes
Improving the correlation hunting in a large quantity of SOM component planes Classification of agro-ecological variables related with productivity in the sugar cane culture . Miguel BARRETO Andrés Pérez-Uribe MINISTERIO DE AGRICULTURA Y DESARROLLO RURAL asocaña
Self Organizing Maps A Self-organizing maps (SOMs) can be seen as a data visualization technique that reduces the dimensionality of data through the use of a self-organizing clustering algorithm. The problem that data visualization attempts to solve is that humans cannot visualize high dimensional data . These techniques can be used to improve the understanding of high dimensional data by visualizing information in a low dimensional space . A SOM presents high dimensional data in a low dimensional space by placing points that are close in the high dimensional space, close in the low dimensional space. From a computational point of view, the self-organizing model is both a projection method which maps high-dimensional data space into low-dimensional space (reduction of dimensionality), and a clustering method , so that similar data samples tend to be mapped to nearby neurons.
Component planes To improve the analysis of the relationships between variables and/or their influence on the outputs of the system, it is possible to slice the Self-organizing maps in order to visualize their so-called component planes Vector n Vector 2 Vector 1 Ra1AS P1AS TMAS V1
Example: Clustering of the SOM easily reveals distinct gene expression patterns: results of a reanalysis of lymphoma study (Junbai Wang et al, 2002) a) 42 DLBCL samples, the color scale of SOM red indicates high expression, blue indicates low expression. b) The cluster numbers resprent gruop of genes contained.
Correlation hunting The task of organizing similar components planes in order to find correlating components is called correlation hunting.
Correlation hunting The expression correlation does not include just linear correlations , but also nonlinear and local or partial correlations between variables
Correlation hunting However, when the number of components is large it is difficult to determine which planes are similar to each other .
Correlation hunting A new SOM can be used to reorganize the component planes in order to perform the correlation hunting. The main idea is to place correlated components close to each other.
Correlation hunting An advantage of using a SOM for component plane projection is that the placements of the component planes can be shown on a regular grid . In addition, an ordered presentation of similar components is automatically generated. A disadvantage is that the choice of grouping variables is left to the user .
More component planes … Heart disease 279 component planes This database contains 13 attributes (which have been extracted from a larger set of 75)
Clustering of SOM component planes based on the SOM distance matrix The U-matrix had been used as an effective cluster distance function. The U-matrix visualizes distances between each map unit and its neighbors, thus it is possible to visualize the SOM cluster structure .
Use the Vellido’s algorithm to partition the map The Vellido’s algorithm is used to obtain different partitioning levels of the clustering of the SOM. The Vellido’s algorithm provides a partitioning of the map into a set of base clusters . The number of clusters is equal to the number of local minima on the U-matrix ; allowing different levels of clustering.
Case study: sugar cane culture <ul><li>The agricultural productivity of a geographic area depends on many </li></ul><ul><li>agro-ecological variables like soil and terrain characteristics, climatic </li></ul><ul><li>constraints, human behavior and management. </li></ul>Soil Management Climate Genotype Productivity
A new approach 1358 experiments Management Climate Genotype Each agroecological event is unique in time and space , but it is possible to find similar characteristics between events that allow finding similar behaviors permitting to discover why and how the agroecological variables affect the crop development and therefore the agricultural productivity . Sowing Growing Harvest Soil
The variables <ul><li>Climate variables. Continuous data. </li></ul><ul><li>Average Temperature (TempAvg), / After seed (AS) / Before Harvest (BH) </li></ul><ul><li>Average Relative Humidity (RHAvg) / After seed (AS) / Before Harvest (BH) </li></ul><ul><li>Radiation (Rad) / After seed (AS) / Before Harvest (BH) </li></ul><ul><li>Precipitation (Prec) / After seed (AS) / Before Harvest (BH) </li></ul><ul><li>Soil variables. </li></ul><ul><li>Order (Ord) / 3 Orders (Ord1, Ord2, Ord3) Nominal Data </li></ul><ul><li>Texture (Tex) / Ordinal Data </li></ul><ul><li>Deep (Dee)/ Ordinal Data </li></ul><ul><li>Topographic variables. </li></ul><ul><li>Landscape (Ls) / 3 Landscapes (Ls1, Ls2, Ls3) Nominal Data </li></ul><ul><li>Slope (Sl). / Ordinal Data </li></ul><ul><li>Other variables. </li></ul><ul><li>Water Balance (WB) Ordinal Data </li></ul><ul><li>Variety (Var) / 3 varieties (V1, V2, V3) Nominal Data </li></ul><ul><li>Production </li></ul><ul><li>Total 54 </li></ul>Months After Seed (AS) Months Before Harvest (BH) 1 2 3 4 1 2 3 4
Classification of agro-ecological variables related with productivity (initial analysis) BMUs of the component planes: productivity, radiation 1 month before harvest (Ra1BH) and radiation 1 month after seed (Ra1AS).
Conclusions <ul><li>Visualization of agroecological variables is very important but difficult due to the high dimensionality of the data . The SOM algorithm is a powerful technique able to deal with this problem, but it is used as an exploratory analysis. </li></ul><ul><li>In this study is presented a methodology to enhance the component planes analysis process . This methodology improves the correlation hunting in the component planes with a tree-structured clusters representation based on the SOM distance matrix . </li></ul><ul><li>By analyzing the obtained groups of agro-ecological variables and cultivated zones, it was possible, as an example of the application of the methodology, to find a relationship between the radiation after seed, before harvest, and a high-medium productivity . </li></ul><ul><li>We are currently looking forward to develop data mining and visualization techniques in order to improve the decision support in the sugar cane culture based on the aforementioned methodology. </li></ul>
The end <ul><li>Thanks for new ideas and directions to explore! </li></ul>