Python Lesson 7: Pandas


Python Lesson 7: Pandas

1. Lesson outline

  1. Introduction to pandas
  2. Working with pandas objects
  3. Loading and saving data in pandas
  4. Dataframe and Series manipulation
  5. Graphics in pandas
  6. Basic statistical modeling
  7. Working with time series

2. Introduction to pandas

The pandas library

Library designed to easily manipulate, clean, and analyze data sets of very different
nature. Many idioms are taken from NumPy, and the main difference is that
pandas allows for handling heterogeneous tabular data, while NumPy only
accepts homogeneous array data sets.

To import the pandas library:

import pandas as pd

We now explain the two most important pandas data structures: series and
dataframes.

2.1. Pandas Series

A pandas series is a 1D array object that contains a sequence of values
(homogeneous or not) and an associated data label array, its index. You can
define a series from any array, having integers from zero to the number of
elements minus one as indexes.

series_ex_I = pd.Series([4,5,6,7,9])
print(series_ex_I)
series_ex_II = pd.Series([4,5,76.7,True])
print(series_ex_II)

You can access in a separate way indexes and values with the attributes index
and value

print(series_ex_II.index)
print(series_ex_II.values)

You can provide the indexes labels -not necessarily integers- when creating the
data structure

series_ex_III = pd.Series([4,5,76.7,True], index=["t1", "t2", "t3", "tbool1"])
print(series_ex_III)

You can access an element by its index value, or provide a list of indexes and
access a subset of values.

# Accessing a single element
print(series_ex_III['t1'])
# Accessing several elements
print(series_ex_III[["t2","tbool1"]])

You can also perform NumPy-like operations on series and select elements in
accordance with a given criterion

print(series_ex_I[series_ex_I > 5])
print(series_ex_I*4)
print(series_ex_III[series_ex_III > 5])
print(series_ex_III*4)

You can check the existence of a given index label with a syntax similar to the
one used in dictionaries

"t2" in series_ex_III

You can also create directly a series from a dictionary

hash_0 = {"Sevilla": 2022, "Huelva": 2044, "Granada": 2033}
series_ex_IV = pd.Series(hash_0)
print(series_ex_IV.index)
print(series_ex_IV.values)

You can also explicitly control the index label ordering

series_ex_IV = pd.Series(hash_0, index = ["Huelva", "Granada", "Sevilla", "Ceuta"])
print(series_ex_IV)

As the Ceuta index does not exist in the dictionary it is created with a NaN
value. You can check the occurrence with these values using the isnull
function or method

print(pd.isnull(series_ex_IV))
print(series_ex_IV.isnull())

A very useful feature is that you can operate with pandas series and they will be aligned by index number.

hash_1 = {"Sevilla": 2, "Huelva": 4, "Granada": 3}
series_ex_V = pd.Series(hash_1, index = ["Ceuta","Granada", "Huelva", "Sevilla"])
# Index alignment
print(series_ex_IV + series_ex_V)

You can also name the series and its index to facilitate its identification

# Series and index names
series_ex_IV.name = "R&R Parameter"
series_ex_IV.index.name = "City"
#
print(series_ex_IV)

You can also alter in-place the index labels of a series

series_ex_III.index = ["Ceuta","Granada", "Huelva", "Sevilla"]

And you can operate with the new series

series_ex_IV + 2*series_ex_V + series_ex_III

2.2. Pandas Dataframes

The most known pandas data structure is the dataframe worked out to mimic the
versatility of GNU R equally named data structures. A dataframe can be
considered a set of series sharing the same indexes. Therefore they have an
index for the rows and a label for each columns. The columns can have data of
different dtype and even data within a column can be of different type.

We will see that there are several different ways of constructing a
dataframe. An usual one is to start with a dict of equal-length lists or NumPy
arrays.

data_student = {"Wall, L": [8,9,9,10,8], "Page, L": [9,7,9,10,10], "Gates, B": [8, 8, 8, 9, 9], "Thompson, K": [10,10,9,9,9], "Rosum, G van":[9,9,8,8,10]}
dframe_1 = pd.DataFrame(data_student)
dframe_1

As can be checked in the output, the rows index is a default one, as in the
Series case, and the columns kept the dictionary order. You can sort them at
your best convenience providing using the columns argument.

dframe_1 = pd.DataFrame(data_student, columns = sorted(data_student.keys()))
dframe_1

In fact if there are column names without associated data in the dictionary, the
column is created with missing values (NaNs). And, as with series, you can
provide a given index too.

cols = list(data_student.keys())
cols.append("Ritchie, D")
#
dframe_2 = pd.DataFrame(
    data_student, columns = sorted(cols), index = ["Math_01", "Math_02", "Alg_01", "OpSys", "Num_An"]
		       )
dframe_2

You can also set the name attribute for the dataframe columns and index
and they will be displayed with the dataframe

dframe_2.index.name = "Subject Grades"
dframe_2.columns.name = "Pro Programmers"
dframe_2

You can retrieve any dataframe row or column. Columns are obtained as a series
using the column name or by attribute, though the second option only is valid
for column names that are also valid Python variable names.

dframe_2["Wall, L"]
# Not a valid variable name
# print(dframe_2."Wall, L")

Note that the series and index names are conserved. Rows can be retrieved using
the loc attribute.

dframe_1.loc[1]
dframe_2.loc["OpSys"]

The values that you retrieve are not a copy of the underlying data, but a view
of them. Be aware that you modify them in place this will affect the original
dataframe. There is a copy method to obtain a copy of the data.

You can modify an existing column or create a new column with a default value by assignment

dframe_2["Ritchie, D"] = 9
dframe_2["Torvalds, L"] = 8
dframe_2

You can also provide as a value an array with the right number of elements

dframe_2["Torvalds, L"] = np.arange(6,11)
dframe_2

If instead of an array you provide a series, there are more flexibility as the
indexes in the array will be aligned with the indexes in the dataframe and any
missing index will be replaced by a NaN missing value.

series_Stallman = pd.Series([9,9,9,9], index = ["Math_01", "Math_02", "OpSys", "Num_An"])
#
dframe_2["Stallman, R"] = series_Stallman
#
dframe_2

You can add a boolean column using the NumPy syntax

dframe_2["test"] = dframe_2["Stallman, R"] >= 9
dframe_2

You can delete the added column using the del keyword

del(dframe_2["test"])
dframe_2

You can use a syntax similar to the one used in NumPy arrays to transpose a
dataframe and exchange the indexes and columns roles.

dframe_2.T

2.3. Dataframe creation methods

There are many other ways of creating a dataframe, apart from the previous one,
consisting on formatting your data as a dictionary of lists or NumPy vectors. We
will provide some of them here:

  • From a nested dict of dicts. The advantage in this case is that the nested
    dictionaries keys are the dataframe indexes

    # Dict of dicts
    d3_dict = {"Planck, M":{ "Hometown": "Kiel", "Born": 1858, "Died": 1947}, "Heisenberg, W":{ "Hometown": "Wurzburg", "Born": 1901, "Died": 1976}, "Fermi, E": { "Hometown": "Roma", "Born": 1901, "Died": 1954}, "Schroedinger, E": { "Hometown": "Erdberg", "Born": 1887, "Died": 1961}}
    dframe_3 = pd.DataFrame(d3_dict)
    dframe_3
    
  • You can also use a dict of series to build the dataframe.

    # dict of series
    d4_dict = {"Page, L": dframe_2["Page, L"][:-1], "Torvalds, L": dframe_2["Torvalds, L"][1:] }
    dframe_4 = pd.DataFrame(d4_dict)
    dframe_4
    
  • Two dimensional ndarray

    # 2D ndarray
    arr_2D = np.ones([4,4])
    dframe_5 = pd.DataFrame(arr_2D, columns=["Col_00", "Col_02","Col_03","Col_04"], index=["a", "b", "c", "d"])
    dframe_5
    

3. Index objects

The array, tuple, list, or any other structure passed to the dataframe
constructor as the index for the data set is transformed into an index
object
.

index = dframe_5.index
index

Those objects are immutable. You can create an index object using the
pd.Index keyword and pass it to a
structure and also these objects can be shared among data structures.

index_obj = pd.Index(np.arange(0,5))
print(index_obj)
# Create a series with the index object
series_ex_VI = pd.Series(np.pi*np.arange(0,5), index = index_obj)
print(series_ex_VI)
# Replace the index in a dataframe with the new index object
dframe_6 = dframe_4.set_index(index_obj)
dframe_6

An index can contain duplicate labels, and then a selection will select all
occurrences of the repeated labels

#
dframe_7 = pd.DataFrame(
    data_student, columns = sorted(cols), index = ["Math_01", "Math_02", "Alg_01", "OpSys", "Math_02"]
		       )
print(dframe_7)
dframe_7.loc["Math_02"]

4. Working with pandas objects

4.1. Reindexing series and dataframes

Reindexing a panda object creates a new object with different index
labels. With a series it rearranges the data according to the new index. Any missing old
index values is removed from the series and
missing NaN values are introduced for nonexistent index values.

print(series_ex_I)
new_series = series_ex_I.reindex(np.arange(1,7))
new_series

When applied to a dataframe, reindex can modify rows, columns, or both. By
default, rows are reindexed according to a given sequence in the same way than
for series.

print(dframe_5)
dframe_8 = dframe_5.reindex(["a", "c", "b", "f", "e", "d"])
dframe_8

The reindex function has several arguments. For example, the argument
fill_value allows for defining a default value for data associated to
non-existent labels when reindexing

dframe_8 = dframe_5.reindex(["a", "c", "b", "f", "e", "d"], fill_value=0)
dframe_8

There is also an option to fill missing values when reindexing, which is quite
useful when working with time series. The option name is method and with the
value ffill~(~bfill) performs a forward(backward) filling.

dframe_9 = pd.DataFrame(np.random.randn(4,4), columns=["Col_00", "Col_02","Col_03","Col_04"], index=np.arange(0,8,2))
print(dframe_9)
dframe_10 = dframe_9.reindex(np.arange(0,8), method="ffill")
dframe_10

You can also reindex columns using the columns keyword

dframe_11 = dframe_5.reindex(columns=["Col_04", "Col_03", "Col_02", "Col_01", "Col_00"])
dframe_11

Notice the difference between reindexing and renaming columns either creating a new object or
in-place editing the dataframe.

print(dframe_11)
print(dframe_11.rename({"Col_04":"Col_a", "Col_02":"Col_b", "Col_00":"Col_d"}, axis = 1))
print(dframe_11)
dframe_11.rename({"Col_04":"Col_a", "Col_02":"Col_b", "Col_00":"Col_d"}, axis = 1, inplace=True)
print(dframe_11)

4.2. Deleting entries from series and dataframes

You can use the drop method that returns a new object with the indicated
entries deleted. This can be applied to series and dataframes.

print(series_ex_VI)
new_series = series_ex_VI.drop([1,3])
print(new_series)
print(series_ex_VI)
new_dfram = dframe_10.drop(range(2,6))
print(new_dfram)
new_dfram = dframe_10.drop(["Col_03", "Col_00"], axis=1)
print(new_dfram)

The inplace=True option allows for the editing of the object, avoiding the
creation of a new object.

4.3. Series indexing, selection, and filtering

You can use a syntax similar to the one used by Numpy with pandas series, in
particular you can use either the index values or integers.

print(series_ex_V)
print(series_ex_V["Granada"])
print(series_ex_V[1])
#
print(series_ex_V[["Granada", "Sevilla"]])

You can apply filters to the series

print(series_ex_V[series_ex_V > 2])

You can also use slicing with integers of index labels, but notice the
difference. The end point of the range is included in the labels case.

print(series_ex_V["Granada":"Sevilla"])
print(series_ex_V[1:3])

You can assign values using these methods

series_ex_V["Granada":"Sevilla"] = 10
print(series_ex_V)

4.4. Dataframes indexing, selection, and filtering

By default, indexing refers to columns and you can set values

print(dframe_7)
print(dframe_7["Thompson, K"])
print(dframe_7[["Gates, B","Thompson, K"]])
dframe_7["Ritchie, D"] = 6

If you use slicing, this works over indexes

print(dframe_7[1:3])

You can select data using a Boolean array

print(dframe_7[dframe_7>8])
print(dframe_7[dframe_7["Page, L"]>8])

You can assign also using this syntax

dframe_7[dframe_7 <= 8] = 5

You can also use loc and iloc for row indexing using axis labels or
integers, respectively.

We can select two rows and two columns by label and by integer values as follows

dframe_7.loc[["Math_01","OpSys"], ["Rosum, G van", "Wall, L"]]
dframe_7.iloc[[0,3], [3,5]]

Both ways of selecting elements work with slices

dframe_7.loc["OpSys":, "Rosum, G van":"Wall, L"]
dframe_7.iloc[0:3, 0:4]

It is important to take into account that if you have an axis index containing
integers, data selection will always be label-oriented to avoid possible
ambiguities. In these cases is preferably to use loc or iloc.

4.5. Arithmetic with series and dataframes

The arithmetic between series with different indexes is performed in such a way
that the final set of indexes is the union of the involved sets and missing
values are introduced in the elements where the two series don’t overlap and
propagate through arithmetic operations.

An example with two series is

print(series_ex_I)
print(series_ex_II)
series_ex_I + series_ex_II

In the case of dataframes a double alignment is performed, in indexes and
columns

print(dframe_4)
print(dframe_7)
dframe_4 + dframe_7

If there are neither columns nor rows in common the resulting series will be
only made of NaN‘s. Using the add method you can pass an argument to fill
with a given value the non-overlapping elements of the dataframe, though if both
elements are missing the result will still be a NaN.

dframe_4.add(dframe_7, fill_value = 0.0)

The following methods are available for series and dataframe arithmetics that
have also reversed versions.

  1. add [~radd~]: addition
  2. sub [~rsub~]: subtraction
  3. div [~rdiv~]: division
  4. floordiv [~rfloordiv~]: floor division
  5. mul [~rmul~]: multiplication
  6. pow [~rpow~]: exponentiation

You can mix series and dataframes in operations that are performed in a similar
way to broadcasting in NumPy. By default the series index is matched against
the dataframe’s columns, broadcasting down the rows

series_ex_VII = pd.Series(np.random.randn(5), index =["Col_00", "Col_01", "Col_02","Col_03","Col_04"] )
print(series_ex_VII)
print(dframe_8)
#
dframe_8 * series_ex_VII

You can also broadcast on the rows, matching the series index versus the
dataframe index. In this case you need to use the method notation.

dframe_2.rsub(series_Stallman, axis = 'index')

You can also perform arithmetic operations with dataframes

print(dframe_10)
print(dframe_9)
dframe_10/dframe_9

4.6. Mapping functions

NumPy universal functions (ufuncs) can be used in series and dataframes

np.cos(dframe_10)

The apply method allows for applying a function to each row or column of a
dataframe and giving as a result a a dataframe or a series with the
corresponding indexes, depending on the function output. By default, the
function is applied to each column.

# series output
g = lambda x: np.mean(x)/np.std(x)
print(dframe_10.apply(g))
##
# dataframe output
f = lambda x: (x - np.mean(x))/np.std(x)
#
dframe_12 = dframe_10.apply(f)
print(dframe_12)
#
print(dframe_12.mean(axis = 0))
#
dframe_13 = dframe_10.apply(f, axis="columns")
print(dframe_13)
#
dframe_13.mean(axis = 1)

You can apply a function that returns multiple values, as a series

def h(x):
    return pd.Series([x.min(), x.max(), x.mean(), x.std(), x.sum()], index=['min', 'max', 'mean', 'std', 'sum'])
print(dframe_10)
dframe_13 = dframe_10.apply(h)
print(dframe_13)

You can apply also element-wise functions to a Series with map and to a
Dataframe with applymap as follows

format_floats = lambda x: '%.2f' % x
# Series
print(dframe_10["Col_00"].apply(format_floats))
# Dataframe
dframe_10.applymap(format_floats)

4.7. Sorting

The sort_index method returns a new object, lexicographically sorted by row or column, and can be applied to
Series,

# Sort series by index
print(series_ex_IV)
#
series_ex_IV.sort_index()

and DataFrames. In this case you can also sort by column name.

# Sort dataframe by index
print(dframe_7)

print(dframe_7.sort_index())

print(dframe_7.sort_index(axis=1, ascending=False))

If instead of sorting by indexes you need to sort by Series of DataFrame values
the method is called sort_values, notice that NaN values are sent to the end
of the series. In case of DataFrames you can use as sorting keys one or various
columns of the DataFrame. In this case it is mandatory to include at least one
column name as a by= argument.

# Sort Series by values 
print(series_ex_IV.sort_values())
# Sort DataFrame by values 
print(dframe_7.sort_values(by="Wall, L"))
print(dframe_7.sort_values(by=["Wall, L", "Thompson, K"]))

Related to sorting is ordering elements by its rank. This is accomplished with
the rank method. At first sight it can be surprising to find non integer
positions, explained by the fact that the method breaks ties assigning the mean
value to the group

print(series_ex_IV)
print(series_ex_IV.rank())
print(dframe_7)
print(dframe_7.rank())
print(dframe_7.rank(axis="columns"))

There are other ways of breaking the ties, using the options min or max that
assigns the minimum or maximum rank to the whole group; first that takes into
account the order of appearance of the element; or dense, similar to min,
but with ranks always increasing by one, independently of the number of elements
in the group.

print(dframe_7)
print(dframe_7.rank(method="min"))
print(dframe_7.rank(method="dense"))

4.8. Descriptive Statistics with Pandas

Pandas provides a set of useful methods to compute descriptive statistical
quantities of your Series or DataFrame.

  1. count: Number of non-NaNs.
  2. describe: Provides summary statistics for a Series or for each DataFrame column.
  3. min, max: Minimum and maximum values
  4. argmin, argmax: Index locations (integers) at which minimum or maximum
    value is obtained.
  5. idxmin, idxmax: Index labels at which minimum or maximum value is obtained.
  6. quantile: Sample quantile ranging from 0 to 1.
  7. sum: Sum of values.
  8. mean: Mean of values.
  9. median: Arithmetic median (50% quantile) of values.
  10. mad: Mean absolute deviation from mean value.
  11. prod: Product of all values.
  12. var: Variance of values.
  13. std: Standard deviation of values.
  14. skew: Skewness of values (third moment).
  15. kurt: Excess Kurtosis of values (fourth moment – 3).
  16. cumsum: Cumulative sum of values.
  17. cummin, cummax: Cumulative minimum or maximum of values, respectively.
  18. cumprod: Cumulative product of values.
  19. diff: Compute first arithmetic difference (useful for time series).
  20. pct_change: Compute percent changes.

Some examples are provided for the normal distribution

# Random Gaussian Dataframe
Elements = 1000
arr_2D = np.random.randn(Elements,6)
cols = []
for index in range(6):
    cols.append("Set_0{0}".format(index))
dframe_14 = pd.DataFrame(arr_2D, columns=cols)
dframe_14

print("Mean\n", dframe_14.mean(), "\n Std.\n", dframe_14.std(), "\nQuantile\n", dframe_14.quantile(q = 0.68), "\nSkewness\n", dframe_14.skew(), "\nKurtosis\n", dframe_14.kurt())

We can also compute more elaborate statistics. As an example data set we can load the file meteodat.csv with an excerpt of data with a 10 minute frequency from an automated meteorological station located in the Experimental Sciences building in the University of Huelva. Provided data span two months (Jan and Feb 2014) and are comma separated so you can
read them using pd.read_csv. The first column is the date and the second
the time. The rest of the columns are

Tout
Temperature
Tmax
Max temperature since last measurement
Tmin
Min temperature since last measurement
H out
Relative humidity
Dew point
Dew point temperature
Wind Speed
Wind speed
Wind Direction
Wind direction (degrees)
Wind Speed Hi
Maximum speed wind gust speed
Wind Direction Hi
Maximum speed wind gust direction
Wind Chill
Apparent temperature
Heat Index
Heat Index
THM Index
Heat Index Threshold for heat advisories
Pressure
Atmospheric pressure
Rain
Collected liquid precipitation
Rain Max
Maximum rainfall rate since last measurement
Tin
Temperature inside building
H in
Relative humidity inside building
Dew Point In
Dew point temperature inside building
Heat Index In
Heat index inside building
# Reading data
data_meteo = pd.read_csv('files/meteodat.csv')
#
print(data_meteo.columns)
# Selecting a data subset
stat_data =  data_meteo[["Tout", "H out", "Dew point", "Pressure", "Wind Speed"]]

Now we can compute some statistics. For example the covariance and correlation matrices

print(stat_data.cov())
print(stat_data.corr())

We can compute also the percent change for a given period

percent_change = stat_data.pct_change(periods=6)
percent_change.tail()

We can compute the correlation with other series of data using corrwith

stat_data.corrwith(data_meteo["Tin"])

Other methods of interest related with Series and DataFrame description is
unique that provides an array of unique values and value_counts that computes
value frequencies

# Selecting unique values
uniques_WD = data_meteo["Wind Direction"].unique()
uniques_WD.sort()
print(uniques_WD)
# 
# Counting number of occurrences
values_WD = data_meteo["Wind Direction"].value_counts()
print(values_WD)

The isin method performs a vectorized check of membership for a
given set and is used to filter a Series or a DataFrame column down to
a given values subset. We select in this case the occurrences of
elements with winds in "SE", "NNE", or "ESE" directions and wind
speed greater than 2 m/s.

# Creating a mask to select some data
mask_WD = (data_meteo["Wind Direction"].isin(["SE","NNE","ESE"])) & (data_meteo["Wind Speed"]>2)

# Using the mask to filter the data
data_meteo[["date", "time", "Wind Speed"]][mask_WD]

Using apply one can perform rather complex data manipulation in a
concise way. We can, for example, calculate for each 10 min interval
the difference between Tmax and Tmin count the number of
occurrences of each value, sorting according to the temperature
difference value and not the number of cases. Notice that it is
necessary to round up to one decimal digit to obtain the desired value
(Do you know why?).

data_meteo.apply(lambda row: round(row["Tmax"] - row["Tmin"],1), axis = 1).value_counts().sort_index()

5. Loading and saving data in pandas

Pandas offers an impressive set of methods for reading/saving data, making
possible to grapple with many different formats. We start first with text
formatted files and include later other formats.

5.1. Text-formatted files

The main functions in Pandas to read text-formatted files are

read_csv
Load delimited data from a file, URL, or file-like object. The
default delimiter is the comma.
read_table
Load delimited data from a file, URL, or file-like object. The
default delimiter is the tab (‘\t’).
read_fwf
Read data without delimiters in a fixed-width column format.
read_clipboard
Version of readtable that reads data from the clipboard. Useful for converting tables from web

pages.

We have already seen an example of read_csv in action loading data from a
URL. The large variety of formats and ways of encoding information into text
files can be handled at the cost of having a plethora of options and modifiers
to the previous functions. The most common ones are

path
String indicating filesystem location, URL, or file-like object.
sep or delimiter
Character sequence or regular expression that marks
field separation in each row.
header
Row number whose entries are used as column names. Defaults to the
first row, and should be None if there is no header row.
index_col
Column numbers or names to use as the row index in the result;
can be a single name/number or a list of them for a hierarchical index.
names
List of column names for result, combine with header=None.
skiprows
Number of rows ignored at the beginning of the file to
ignore. Also it can be given as a list of row numbers (starting from 0) to skip.
na_values
Sequence of values to replace with NA.
comment
Character(s) marking comments to split comments off the end of lines.
parse_dates
Attempt to parse data to datetime; False by default. If True,
will attempt to parse all columns. Otherwise can specify a list of column
numbers or name to parse. If element of list is tuple or list, will combine multiple columns together and parse to date (e.g., if date/time split across two columns).
keep_date_col
If joining columns to parse date, keep the joined columns; False by default.
converters
Dict containing column number of name mapping to functions
(e.g., {‘foo’: f} would apply the function f to all values in the ‘foo’ column).
dayfirst
When parsing potentially ambiguous dates, treat as international
format (e.g., 7/6/2012 -> June 7, 2012); False by default.
date_parser
Function to use to parse dates.
nrows
Number of rows to read from beginning of file.
iterator
Return a TextParser object for reading file piecemeal.
chunksize
For iteration, size of file chunks.
skip_footer
Number of lines to ignore at end of file.
verbose
Print various parser output information, like the number of
missing values placed in non-numeric columns.
encoding
Text encoding for Unicode (e.g., ‘utf-8’ for UTF-8 encoded text).
squeeze
If the parsed data only contains one column, return a Series.
thousands
Separator for thousands (e.g., ‘,’ or ‘.’).

As an example we can read into a dataframe one of the monthly temperature files used in previous
examples

pd.read_csv("files/TData/T_Araxos_EM.csv")

By default the first row elements are used to label columns. We can choose our
own labels and also transform the year into the dataframe index

tdata_Araxos = pd.read_csv("files/TData/T_Araxos_EM.csv", index_col=0, names = ["Jan", "Feb", "Mar", "Apr","May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"], skiprows=1)
tdata_Araxos

Notice that in this case we are also forced to skip the first row.

In case that a file is very long, we may need to read it piecewise or iterate over
files small portions using the nrows argument.

As regards data saving, the method to_csv allows for writing data files as comma separated files. We
can write the dataframe read above into a file. By default the separator used is
a comma, but you can define another character as separator with the sep option

tdata_Araxos.to_csv("temp_Data_Araxos.csv", sep=";")
!cat temp_Data_Araxos.csv

By defauly row and column labels are included in the file. This can be disabled
using the index=False and header=False options, respectively. Note that missing values appear as empty strings and can be denoted by a so
called sentinel value using the option narep.

tdata_Araxos["UndefVals"] = np.nan
tdata_Araxos.to_csv("temp_Data_Araxos.csv", sep=":", na_rep="NULL")
!cat temp_Data_Araxos.csv

You can also save a number of columns and in an arbitrary order you define

tdata_Araxos.to_csv("temp_Data_Araxos.csv", sep=":", columns=["Mar", "Apr", "May"])
!cat temp_Data_Araxos.csv
Exercise 1
Exercise 2

5.2. Other file types

read_json, to_json
Read or save data using JSON (JavaScript Object Notation) string representation. Library: json.
read_pickle, to_pickle
Read or save an object in Python pickle format.
read_excel, to_excel
Read or save data using Excel XLS or XLSX file standard. Use packages
xlrd and openpyxl.
read_hdf, to_hdf
Read or save using HDF5 (hierarchical data format) standard.
read_html
Read all tables found in the given HTML or XML document.
read_msgpack
Read pandas data encoded using the MessagePack binary format.
read_sas
Read a SAS dataset stored in one of the SAS system’s custom storage formats.
read_sql
Read the results of a SQL query (using SQLAlchemy) as a pandas DataFrame.
read_stata
Read a dataset from Stata file format.
read_feather
Read the Feather binary file format

6. Dataframe and series manipulation

6.1. Hierarchical indexing

You can define series and dataframes with more than one level of indexing paving
the way to multidimensional data treatment in a lower dimensional form. This is
an example for a series

series_hind = pd.Series(np.random.randn(10), index=[[0,0,0,0,1,1,1,2,3,4],["x","y","z","t","x","y","z","x","y","z"]])
#
print(series_hind)
print(series_hind.index)

You can access the elements making use of the different index levels

print(series_hind[0])
print(series_hind[1,"z"])

And you can also slice through the indexes making use of the loc function and
the slice function or the Numpy syntax.

print(series_hind.loc[slice(1,3)])
print(series_hind.loc[1:3])
print(series_hind.loc[:,"z"])

Note than in this case slices are inclusive.

Hierarchical index eases the reshaping of data. Making use of the unstack
method we can transform the previous series into a dataframe, filling with NaNs
undefined values

series_hind.unstack()

The opposite operation is performed -unsurprisingly- by the stack method,
transforming a dataframe into a multi-index series.

The extension of hierarchical indexes to dataframes implies that both rows and
columns can have multiple indexes. Let’s build an example dataframe with hierarchical
indexes in rows and columns.

frame_hind = pd.DataFrame(np.random.randint(0,100,size=(6,6)), index=[["i","i","ii","ii","iii","iii"],["a","b","a","b","a","b"]], columns=[["CA","CA","CA","HU","HU","HU"], ["S1", "S2", "S3","S1", "S2", "S3"]])
frame_hind

You can name the indexes, helping to make your code more understandable

frame_hind.index.names = ["Key a", "Key b"]
frame_hind.columns.names = ["Prov", "Section"]
frame_hind

It is now easy to select a group of columns

frame_hind["HU"]

You can change the order of the levels in rows and columns and also sort the
data according to some criterion. If, for example, you want to exchange the
levels order in both rows and columns you can use the swaplevel function,
taking into account that it returns a new object

frame_hind_new = frame_hind.swaplevel("Prov", "Section", axis = 1).swaplevel("Key a", "Key b")
frame_hind_new 

The sort_index command sorts levels according to a single index level

frame_hind.sort_index(level=1)

You can combine swaplevel and sort_index functions

frame_hind.swaplevel("Prov", "Section", axis = 1).sort_index(level=1, axis=1)

In many statistics functions you have a level option that allows for
specifying the level at which the manipulation is intended

print(frame_hind.sum(level="Key b"))
print(frame_hind.sum(axis=1,level="Prov"))
print(frame_hind.mean(axis=1, level="Section"))

You can easily create new DataFrames with hierarchical indexing using the
set_index and reset_index functions. The first one takes one or more columns
as new indexes for a new DataFrame, while the second does the inverse
operation. Depending on the drop argument value, the columns will disappear or
not.

test_ri = frame_hind_new.reset_index()
print(test_ri)
test_si = test_ri.set_index(["Key b", "Key a"], drop=False)
print(test_si)

6.2. Data Handling

Different Pandas objects can be combined using the functions merge and
concat.

The merge function connect rows in DataFrames based on one or more keys, in a
way familiar to those used to work with relational databases.

frame_1= pd.DataFrame(np.random.randint(0,4,size=(5,4)), columns=["S1", "S2", "S3","S4"])
frame_2= pd.DataFrame(np.random.randint(0,6,size=(6,4)), columns=["K1", "K2", "K3","K4"])
#
print(frame_1)
print(frame_2)
#
pd.merge(frame_1, frame_2, left_on = "S3", right_on = "K3")

By default an inner merge is performed and the intersection of the key set is
used. Therefore the common elements in the columns indicated are selected and
with the the accompanying elements from other columns. The default behavior can be changed using the left, right, or outer options

pd.merge(frame_1, frame_2, left_on = "S3", right_on = "K3", how="outer")

You can also use keys from various columns

pd.merge(frame_1, frame_2, left_on = ["S2","S3"], right_on = ["K1","K3"], how="inner")

When columns names are equal on the joined dataframe, pandas by default add a
namex or namey suffix. This can be changed using the suffixes option to
merge.

The concat function concatenates or stacks together objects along a given
axis, in a similar way to the NumPy concantenate function. The labeled axes in
Pandas results in different options when binding two Series or DataFrames. You
can choose either to stack only the intersection or the union of the indexes,
you can decide to discard the labels on the axes being binded or keep the
concatenated data structures identifiable after merging.

If we use as arguments Series having no common index, the concat function
simply stacks them and creates a new Series

pd.concat([series_ex_I,series_ex_III,series_ex_VII])

By default concat acts on axis=0, creating a new Series, the option
axis=1 will turn the output into a DataFrame with each Series as a column and
filling with NaN the void values (an outer join).

pd.concat([series_ex_I,series_ex_III,series_ex_VII],axis =1)

If there are common indexes and the operation is performed along axis = 0
there will be repeated index values while in the axis = 1 case the number of
undefined NaN values will be reduced.

series_ex_VIII = pd.concat([series_ex_III,series_ex_VII]) + 1
print(pd.concat([series_ex_I,series_ex_III,series_ex_VIII]))
print(pd.concat([series_ex_I,series_ex_III,series_ex_VIII], axis = 1))

In the axis=1 case, you can perform an inner join using the argument join = inner.

pd.concat([series_ex_III,series_ex_VIII], axis=1, join = "inner")

To keep track of the initial arguments you can use an hierarchical index in the
concatenation axis with the keys argument

pd.concat([series_ex_I,series_ex_III,series_ex_VIII],axis =0, keys=["I", "III", "VIII"])

If axis=1 the keys will be used as column names.

pd.concat([series_ex_I,series_ex_III,series_ex_VIII],axis =1, keys=["I", "III", "VIII"])

You can also concatenate DataFrames. Notice the difference between specifying
axis = 1 or using the default. If you specify a keys argument, it is used to
define
a hierarchical axis in a given axis.

print(pd.concat([dframe_1,dframe_4]))
#
print(pd.concat([dframe_1,dframe_4], axis=1))
#
print(pd.concat([dframe_1,dframe_4], keys=["DF_1", "DF_4"]))
print(pd.concat([dframe_1,dframe_4], axis=1, keys=["DF_1", "DF_4"]))

You can also specify the key values as the keys of a dictionary that replaces
the list argument.

print(pd.concat({"DF_1": dframe_1,"DF_4": dframe_4}, axis=1))

In case the information on the index is not relevant you can use the
ignore_index=True option

print(pd.concat([dframe_1,dframe_5], axis=0, ignore_index=True))
print(pd.concat([dframe_1,dframe_5], axis=1, ignore_index=True))
Exercise 3

7. Working with time series

Working with time series is a complex subject and mastering it requires time and
dedication. Time series are ubiquitous and can be found in Chemistry, Physics,
Ecology, Economics and Finance, Medicine, and Social Sciences. Pandas provides
tools to work with fixed frequency data as well as with irregular time series.

Native Python provides a way to deal with time data with the the datetime data type and the modules datetime,
time, and calendar. For example

from datetime import datetime
now = datetime.now()
print(type(now))
print(now.year, now.month, now.day, now.hour, now.minutes)
now

Time is stored down to the microsecond. The time difference between two
different times is represented as a timedelta object

delta = now - datetime(1969,10,9)
delta

The timedelta function allows to shift a given time by some amount

from datetime import timedelta
now - timedelta(days=10, hours=3, minutes = 20)

You can format datetime objects using str or -for a given format-
strftime.

print(str(now))
print(now.strftime("%d/%m/%Y %H.%M"))

The possible format specification codes for strftime are

%Y
Four-digit year
%y
Two-digit year
%m
Two-digit month
%d
Two-digit day
%H
Hour (24-hour clock)
%I
Hour (12-hour clock)
%M
Two-digit minute
%S
Second [00, 61] (seconds 60, 61 account for leap seconds)
%w
Weekday as integer [0 (Sunday)]
%U
Week number of the year [00, 53]; Sunday is considered the first day of the week, and days before the first Sunday of the year are “week 0”
%W
Week number of the year [00, 53]; Monday is considered the first day of the week, and days before the first Monday of the year are “week 0”
%z
UTC time zone offset as +HHMM or -HHMM; empty if time zone naive
%F
Shortcut for %Y-%m-%d (e.g., 2012-4-18)
%D
Shortcut for %m/%d/%y (e.g., 04/18/12)

The same formats are used to convert strings to dates using the strptime
function

datetime.strptime("2020 12 26 22h11m", "%Y %m %d %Hh%Mm")

To avoid the explicit format definition, the method parse.parse, able to
translate many different date string formats, can be used

from dateutil.parser import parse
parse("2020-12-10")
parse('Jan 01, 1957 11:45 PM')
parse('12/12/2011', dayfirst=True)

Pandas work usually with time data arrays, either as axis index or as
columns. It provides the to_datetime method to parse many different
date representations

pd.to_datetime(["2021-11-26", "2021-11-26 00:12","2021-11-26 22:33"])

It can handle gaps or missing elements in the time series introducing NaT values

pd.to_datetime(["2021-11-26", "2021-11-26 00:12","2021-11-26 22:33:10"] + [None])

We will only give some hints on how to work with time series creating a dataframe, test_dataframe, with hourly data filled with random values from different distributions as follows

# Create dataframe with random data from different distributions with hourly frequency
num_days = 31
rng = np.random.default_rng()
time = list(map(lambda x: str(x)+":00", range(0,24)))
date = list(map(lambda x: str(x) + "/01/2000",range(1,num_days+1)))
list_times = list(map(lambda x: [x +" "+ time_val for time_val in time], date))
time_date = []
for data_item in list_times:
    time_date += data_item
dict_initial = {"date": time_date, "data_normal": rng.normal(size=num_days*24), "data_beta":rng.exponential(size=num_days*24), "data_uniform":rng.uniform(low = -1, size=num_days*24)}
test_dataframe = pd.DataFrame(dict_initial)
test_dataframe 

We create a new column,
named time_date, transforming the date values into timestamps with the to_datetime function.

# Create dataframe with random data from different distributions with hourly frequency
test_dataframe['date_time']=pd.to_datetime(test_dataframe['date'], format="%d/%m/%Y %H:%M") 

A useful feature is to change the time frequency, downsampling or
upsampling data. In this case we will downsample to daily and weekly
frequencies using the resample method. This method first groups the data according
to a given criterion and then calls an aggregation function.

daily_data = test_dataframe.resample("1d", on = "date_time").mean(numeric_only=True) # numeric_only needed to avoid error trying to calculate mean value of "date_time"
print(daily_data)
#
weekly_data = test_dataframe.resample("1W", on = "date_time").mean(numeric_only=True) 
print(weekly_data)

In this case you are applying the same aggregation function to each column, but
you can instead specify different functions for different columns using the
agg method.

test_agg = test_dataframe.resample("1W", on = "date_time").agg({"data_normal":"mean",  "data_beta":"std", "data_uniform":"mean"})
print(test_agg)

You can also apply different functions to a single column as follows

test_agg = test_dataframe.resample("1d", on = "date_time")["data_normal"].agg(("sum","mean","std"))
print(test_agg)

8. Graphics in Pandas

We can now depict the data for different frequencies and using different formats

dframe_AND.plot()
dframe_AND.plot.area(subplots=True,figsize=(9,12),sharex=True)
dframe_AND_Monthly.plot.bar(stacked=True)
dframe_AND_Weekly.plot.bar(stacked=True, figsize=(10,7))
dframe_AND_Weekly.plot.bar(subplots=True,figsize=(12,8),sharex=True)
dframe_AND_Monthly.plot.barh(subplots=True,figsize=(7,20),sharex=True)

Apart from the use of Matplotlib methods, Pandas has its own built-in methods to visualize data saved in Series and
DataFrames and, on top of this, we may use the Seaborn library, which
modifies default Matplotlib settings and allows for a fast and convenient way
to make complex plots to infer possible statistical relations among data sets. In order to install Seaborn you
only need to run, in the environment where you intend to
run the library

$ conda install seaborn

An interesting piece of advice is, once a plot get complex enough, to sketch by hand your plot -or your idea of
the plot- before you begin coding it. That way, you might be able to identify
how appropriate is the plot, as well as being able to clearly mark the objects and
relationships that you want to be conveyed by the plot. Be extra careful on having readable legends, axes and ticks labels.

Very basic plots can be depicted using the built-in plot method. If no column name is specified results for all columns are displayed.

test_dataframe[["data_normal", "data_beta", "data_uniform"]].plot()

The default behavior is to create a line for each column and label it with the
column name. The plot attribute has a series of methods to create different
plot types. The default is line.

You can customize the plot with some or all of the following options.

subplots
Plot each DataFrame column in a different subplot.
sharex
If subplots=True, share the same x axis, linking ticks and limits.
sharey
If subplots=True, share the same y-axis
figsize
Size of figure to create. Tuple with (width, height).
title
Plot title as string.
legend
Add a subplot legend (True by default).
sortcolumns
Plot columns in alphabetical order; by default uses existing column order.

We can modify the previous plot using these options

test_dataframe[["data_normal", "data_beta", "data_uniform"]].plot(subplots=True) # Plotting each column in a subplot

You can also trivially plot histograms. For example

test_dataframe[["data_normal", "data_beta", "data_uniform"]].plot(kind = "hist", bins=51, alpha = 0.75) 

You can also stack the results

test_dataframe[["data_normal", "data_beta", "data_uniform"]].plot(kind = "hist", bins=51, stacked=True)

Also, one of the built-in Pandas option is the representation of Kernel Density
Estimate plots, a kind of density plot, that provides an estimate of the
probability distribution behind the data.

test_dataframe[["data_normal", "data_beta", "data_uniform"]].plot.kde()

Instead of kde one can use the modifier density. Both are equivalent.

You plot data using the Seaborn library. The first step is to select a particular
Seaborn style, if you want to change the default one. Possible options for the style
argument are: darkgrid, whitegrid, dark, white, and ticks.

sns.set(style="whitegrid")

We first build some figures using the one of provided Seaborn datasets, that are
Pandas dataframes

tips = sns.load_dataset("tips")
tips

This is a Pandas dataframe with data about the bills and tips left by different
parties in a restaurant. Seaborn command relplot allows for easy and direct
creation of complex relational plots, like the following:

sns.relplot(
    data=tips,
    x="total_bill", y="tip", col="time",
    hue="smoker", style="smoker", size="size",
)

In this case the total_bill and tip columns are plot as abscissa and
ordinates. Data are plot in two columns (Lunch and Dinner) defined by the time
column values. Color and character are fixed according to the sex column
values, and the size of the character is determined by the party size in column
size.

A similar concise syntax is used in the Seaborn lmplot function, that
includes in the scatterplot the results and uncertainty of a linear regression
fit to the provided data.

sns.lmplot(data=tips, x="total_bill", y="tip", col="time", hue="smoker")

Another interesting function is displot that combines histograms and kernel
density estimations to obtain an approach to the distribution of probability of
the variable under study

sns.displot(data=tips, x="total_bill", col="time", kde=True)

You can also depict, in a single step, the empirical cumulative distribution
function of the data using the kind="ecdf" argument

sns.displot(data=tips, kind="ecdf", x="total_bill", col="time", hue="smoker", rug=True)

In the case of categorical data, appropriate representations can be achieved
with the catplot command and representation options like swarm or violin.

sns.catplot(data=tips, kind="swarm", x="day", y="tip", hue="smoker")
Exercise 4

9. Basic Statistical Modeling

We will use the statsmodel library for the statistical treatment of
datasets. You can get plenty of information about this library in the
Statsmodel Homepage.

The first step is loading the necessary libraries. A previous installation of the
statsmodel package in the selected conda environment is required.

import statsmodels.api as sm
import statsmodels.formula.api as smf
from patsy import dmatrices

We now create a dataframe with data for a cubic function adding gaussian noise with two different variances (different scale parameters) and we plot the resulting data, stored in the pd_data dataframe

# Generate data for cubic fit, with two different values of normal noise
# Abscissa values
npoints = 91
x_values = np.linspace(-5, 5, npoints)
#
# Parameter values a_0 + a_1*x + a_2*x**2 + a_3*x**3 
a_0_par, a_1_par, a_2_par, a_3_par = 12.0, -5.5, 2.5, 0.5
#
# Calculate values
y_values = np.full((npoints), a_0_par)
for index, a_par in enumerate((a_1_par, a_2_par, a_3_par)):
    y_values += a_par*x_values**(index+1)
#
# Build Dataframe and add noise
rng = np.random.default_rng()
pd_data = pd.DataFrame({"X": x_values, 
			"Y_1": y_values + rng.normal(loc = 0, scale = 2.5, size = npoints), 
			"Y_2": y_values + rng.normal(loc = 0, scale = 7.5, size = npoints)})
#
pd_data.plot(x="X", y=["Y_1", "Y_2"], ls='', marker=".")

A simple linear fit can be performed using the OLS statsmodel function

model = sm.OLS(pd_data["Y_1"], sm.add_constant(pd_data["X"]))
result_OLS_1 = model.fit()
result_OLS_1.summary()

Notice that it is necessary, to include in the fit a constant term, to
explicitly add a constant to the abscissa values with the
add_constant method.

The summary provides a rich statistical info where the most relevant items are the model R-squared and Adj. R-squared and
the P>|t| values associated with the model parameters. The R-squared value tells you how much variation in the ordinate is explained by the abscissa values. The Adj. R-squared value is an adjustment of the correlation coefficient based on the number of
observations and of residuals degrees of freedom. In the coef section the most relevant fields are coef, std_error, P>|t| and [0.025 0.975].

The values under coef are the optimized parameter values, in this case the ordinate in the origin (const label) and the slope (X label). The std err column provides the standard error in each coefficient.
Finally, P>|t| is the p-value for the t-test for significance of each parameter (if p < 0.05 then the parameter is significant to a 95% confidence level) and the values under [0.025 0.975]
define the 95% confidence interval for each parameter (if the coef falls into the interval, it is significant to a 95% confidence level).

If we intend to perform a linear fit with no ordinate in the origin using the same syntax we do not add the constant term.

model = sm.OLS(pd_data["Y_1"], pd_data["X"])
result_OLS_no_intercept_1 = model.fit()
result_OLS_no_intercept_1.summary()

A graphical representation of the input data and the linear fit is prepared defining a new set of abscissa values and applying the optimized function to them using the predict method with the fit results

X_pred = np.linspace(pd_data["X"].min(), pd_data["X"].max(), 21)
X_pred = sm.add_constant(X_pred) 

#+BEGIN_SRC python :results output
X_prime = np.linspace(X.total_bill.min(), X.total_bill.max(), 10)[:, np.newaxis] 
X_prime = sm.add_constant(X_prime) 
#
Y_pred = result_OLS_1.predict(X_pred) # Calculate the predicted values 

#
fig, ax = plt.subplots()
ax.scatter(pd_data["X"], pd_data["Y_1"], alpha=0.3) # Plot the raw data 
ax.plot(X_pred[:,1], Y_pred, 'r', alpha=0.9, label = "Constant term") # Plot the predicted data 
#
ax.legend()
ax.set_xlabel("X") 
ax.set_ylabel("Y")

In this case a constant term is assumed in the linear relationship (the
intercept). To fix it to zero and perform a single parameter fit the syntax is

We may be interested to perform a linear fit using a more complex function. The general procedure is as follows

model = smf.ols(formula='Y_1 ~ np.power(X, 3) + np.power(X, 2) + X', data=pd_data) # Cubic model definition
result_cubic_1 = model.fit() # Fit to the model
result_cubic_1.summary() # Display statistical info

In this case the constant term is included by default. You can explicitly remove it as follows

model = smf.ols(formula='Y_1 ~ np.power(X, 3) + np.power(X, 2) + X - 1', data=pd_data) ## Notice the -1 term removing a constant contribution.
result_cubic_1_no_intercept = model.fit()
result_cubic_1_no_intercept.summary()

We can now plot the results
#+BEGINSRC python :results output
Xpred = np.linspace(pddata[“X”].min(), pddata[“X”].max(), 21)

Ypred = resultcubic1.predict({“X”:Xpred})
Ypredno = resultcubic1nointercept.predict({“X”:Xpred})

The most relevant information in the summary are the items that follow

Fit parameters
Can be accessed using res.params.
R-squared
The correlation coefficient of determination. A statistical
measure of how well the regression line approximates the real data
points. Perfect fir for R-squared equal to one.
Adj. R-squared
The correlation coefficient adjusted based on the number of
observations and of residuals degrees-of-freedom.
P > |t|
P-value that the null-hypothesis that the coefficient = 0 is true. If it is less than the confidence level, often 0.05, it indicates that there is a statistically significant relationship between the term and the response.
[95.0% Conf. Interval]
The lower and upper values of the 95% confidence
interval.

10. Working with time series

Working with time series is a complex subject and mastering it requires time and
dedication. Time series are ubiquitous and can be found in Chemistry, Physics,
Ecology, Economics and Finance, Medicine, and Social Sciences. Pandas provides
tools to work with fixed frequency data as well as with irregular time series.

Native Python provides a way to deal with time data with the modules datetime,
time, and calendar and the datetime data type. For example

from datetime import datetime
now = datetime.now()
print(type(now))
print(now.year, now.moth, now.day, now.hour, now.minutes)
now

Time is stored down to the microsecond. The time difference between two
different times is represented as a timedelta object

delta = now - datetime(1969,10,9)
delta

The timedelta function allows to shift a given time by some amount

from datetime import timedelta
now - timedelta(days=10, hours=3, minutes = 20)

You can format datetime objects using str or -for a given format-
strftime.

print(str(now))
print(now.strftime("%d/%m/%Y %H.%M"))

The possible format specification codes for strftime are

%Y
Four-digit year
%y
Two-digit year
%m
Two-digit month
%d
Two-digit day
%H
Hour (24-hour clock)
%I
Hour (12-hour clock)
%M
Two-digit minute
%S
Second [00, 61] (seconds 60, 61 account for leap seconds)
%w
Weekday as integer [0 (Sunday)]
%U
Week number of the year [00, 53]; Sunday is considered the first day of the week, and days before the first Sunday of the year are “week 0”
%W
Week number of the year [00, 53]; Monday is considered the first day of the week, and days before the first Monday of the year are “week 0”
%z
UTC time zone offset as +HHMM or -HHMM; empty if time zone naive
%F
Shortcut for %Y-%m-%d (e.g., 2012-4-18)
%D
Shortcut for %m/%d/%y (e.g., 04/18/12)

The same formats are used to convert strings to dates using the strptime
function

datetime.strptime("2020 12 26 22h11m", "%Y %m %d %Hh%Mm")

To avoid the explicit format definition, the method parse.parse, able to
translate many different date string formats, can be used

from dateutil.parser import parse
parse("2020-12-10")
parse('Jan 01, 1957 11:45 PM')
parse('12/12/2011', dayfirst=True)

Pandas work usually with time data arrays, either as axis index or as
columns. It provides the to_datetime method to parse many different
date representations

pd.to_datetime(["2021-11-26", "2021-11-26 00:12","2021-11-26 22:33"])

It can handle gaps or missing elements in the time series introducing NaT values

pd.to_datetime(["2021-11-26", "2021-11-26 00:12","2021-11-26 22:33:10"] + [None])

We will only give some hints on how to work with time series
using the previous example for Covid-19 data in
Spain. We first read the data from the public server and create a new column
named “time” transforming the date into a timestamp with to_datetime

data_CoViD=pd.read_csv('https://cnecovid.isciii.es/covid19/resources/datos_provincias.csv') 
data_CoViD['time']=pd.to_datetime(data_CoViD['fecha']) 

We now select the data for the total number of cases in the eight Andalousian
provinces using as an index the time data. We start with the province of Almería (AL)

# Create a Dataframe with total number of cases
dframe_AND = pd.DataFrame({"AL": data_CoViD[data_CoViD["provincia_iso"]=="AL"]["num_casos"]})
dframe_AND = dframe_AND.set_index(data_CoViD[data_CoViD["provincia_iso"]=="AL"]["time"])
#
fig, ax = plt.subplots()
ax.scatter(pd_data["X"], pd_data["Y_1"], alpha=0.3) # Plot the raw data 
ax.plot(X_pred, Y_pred, 'r', alpha=0.9, label = "Constant term") # Plot the predicted data 
ax.plot(X_pred, Y_pred_no, 'g', alpha=0.9, label = "No constant term") # Plot the predicted data 
#
ax.legend()
ax.set_xlabel("X") 
ax.set_ylabel("Y") 

Notice that the R-squared value is larger in the no constant term case than in the following, due to an inconsistent way of calculating this parameter without the constant term. There is a warning informing the user that is displayed at run time.

Author: Curro Perez Bernal

Created: 2025-09-04 Thu 14:15

Validate