Some Math On Linear and Polynomial Regression

Linear regression is widely used in statistics and machine learning mainly for pretictive analysis, such as estimating trends etc. To minimize the error, we take derivatives with respect to b0 and b1…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




How to Evaluate Relatedness Between Categorical Variables Using the Seaborn Library

Correlations are simple to evaluate between numeric variables using scatterplots, but how about categorical variables?

Scatterplots are great visualisation tools to assess relationships and associations between numeric or continuous variables. However, using data points to evaluate categorical variables may not be as straightforward.

Consider a common scenario where a researcher wants to find out in a microarray (containing ~20,000 transcripts) whether experimental condition A elicits the same gene expression profile as condition B. Plotting strip plots or box plots to visualise all gene expression differences and trends will be challenging as there is a large number of data points.

As discussed previously, clustergrams or heatmaps could be another alternative to visualise gene expression differences. However, these charts do not provide statistics to measure if the trends in gene expression differences are similar or different.

To get around these limitations, correlation matrices and pair plots can be used, both of which can be plotted with the Seaborn library. If working from raw values, you will need to normalise your data by calculating log2–transformed fold-change (log2FC) values with respect to control/placebo for static data. If data is temporal, then the log2FC can be calculated with respect to time = 0 (baseline).

To plot correlation matrix and pair plots using Python, we first load the required packages. In this blog entry, we will be using the Seaborn and matplotlib library:

In this specific blog entry, we will analyse the correlation (or relatedness) between the different time points after Merck Ad5/HIV vaccination.

We will load and inspect the processed dataframe from GitHub. It is important to label the gene column as the index column for reference. The commands are as follows:

The output file shows the values of the p-value (pval), adjusted p-values (qval), ratio, and fold change (fc) for 6 hours, 1-day, 3-day and 7-day time points compared to baseline (timepoint = 0):

Output file is thus as follows:

To tabulate the correlation coefficient between the different time-points, the code is as follows:

Output showing the correlation coefficients are:

The data suggest the gene signatures in day 1 is most similar to day 3. Interestingly, the signatures at 6 hours post-vaccination is also similar to day 7.

Next, we can evaluate the p-value of the correlation, to test the significance of the correlation. We import SciPy and execute the commands are as follows:

Output is as such:

Note that all the correlations are significant, probably because of a large number of data points considered for statistical analysis.

To visualise these correlation coefficients in a correlation matrix, we can use the following commands:

I will briefly describe the commands above. The commands add a mask on the top half of the correlation matrix and correlations between the same variables so that users can concentrate on the comparisons on the lower half of the plot. I have also defined the figure size, the colour map used (range from blue to red, where blue is negative correlation and red is a positive correlation), and centred the correlation values at 0 (white). The annotations (annot) provide the correlation coefficient values in the graphs and the lines (linewidth) allow us to separate the squares more nicely.

The output file for the correlation matrix is displayed below:

To allow us to see the points that make up the correlation matrix, we can use the commands as follows to plot a pair plot:

Note that the lower half of the pair plots will contain the regression plot for us to visualise the trend and slopes more clearly. This is particularly important in this case as there are a large number of data points. Output file is as follows:

For easy referencing, the full set of codes are as follows:

And there you have it. Thanks for reading.

Add a comment

Related posts:

YouTube scammers stole over 1.5 million XRP from investors

The organizers of the fraudulent distribution of cryptocurrency on YouTube embezzled 1.543 million XRP ($ 454,000). Such data in its new study are provided by the company Xrplorer. According to her…

Maybe 10am

What if I poured myself into a tight black Lululemon and then walked the dog? What would people think? Laugh laugh ridiculous! Does he not know who he is??? He needs to hang a mirror by the front…