// Some code
Python R
- High-level, general-purpose - High-level, specialized
programming language programming language
for statistical computing
and data analysis
- Object-oriented and imperative - Functional and object-oriented
programming styles programming styles
- Rich standard library and - Rich standard library and
extensive ecosystem of extensive ecosystem of
third-party libraries and third-party libraries and
frameworks packages
- Used for a wide range of - Used primarily for statistical
applications, including web computing and data analysis
development, scientific
computing, data analysis,
and artificial intelligence
- Syntax is known for its - Syntax is known for its
simplicity and readability expressiveness and power
- Dynamically typed - Dynamically typed
- Supports multiple programming - Supports functional programming
paradigms, including imperative, and object-oriented programming
functional, and object-oriented paradigms
programming
- Popular implementations include - Popular implementations include
CPython, PyPy, and Jython R and R Studio
Overall, Python and R are both popular and powerful languages, but they have some key differences in their design and intended use. Python is a general-purpose programming language that is used for a wide range of applications, while
Functionality / flexibility: what can/cannot be done with each tool
Performance: how fast are operations. Hard numbers/benchmarks are preferable
Ease-of-use: Is one tool easier/harder to use (you may have to be the judge of this, given side-by-side code comparisons)
This page is also here to offer a bit of a translation guide for users of these R packages.
Quick reference
Querying, filtering, sampling
R
pandas
dim(df)
df.shape
head(df)
df.head()
slice(df, 1:10)
df.iloc[:9]
filter(df, col1 == 1, col2 == 1)
df.query('col1 == 1 & col2 == 1')
df[df$col1 == 1 & df$col2 == 1,]
df[(df.col1 == 1) & (df.col2 == 1)]
select(df, col1, col2)
df[['col1', 'col2']]
select(df, col1:col3)
df.loc[:, 'col1':'col3']
select(df, -(col1:col3))
distinct(select(df, col1))
df[['col1']].drop_duplicates()
distinct(select(df, col1, col2))
df[['col1', 'col2']].drop_duplicates()
sample_n(df, 10)
df.sample(n=10)
sample_frac(df, 0.01)
df.sample(frac=0.01)
R’s shorthand for a subrange of columns (select(df, col1:col3)) can be approached cleanly in pandas, if you have the list of columns, for example df[cols[1:3]] or df.drop(cols[1:3]), but doing this by column name is a bit messy.
Sorting
R
pandas
arrange(df, col1, col2)
df.sort_values(['col1', 'col2'])
arrange(df, desc(col1))
df.sort_values('col1', ascending=False)
Transforming
R
pandas
select(df, col_one = col1)
df.rename(columns={'col1': 'col_one'})['col_one']
rename(df, col_one = col1)
df.rename(columns={'col1': 'col_one'})
mutate(df, c=a-b)
df.assign(c=df['a']-df['b'])
Grouping and summarizing
R
pandas
summary(df)
df.describe()
gdf <- group_by(df, col1)
gdf = df.groupby('col1')
summarise(gdf, avg=mean(col1, na.rm=TRUE))
df.groupby('col1').agg({'col1': 'mean'})
summarise(gdf, total=sum(col1))
df.groupby('col1').sum()
Base R
R makes it easy to access data.frame columns by name
A common way to select data in R is using %in% which is defined using the function match. The operator %in% is used to return a logical vector indicating if there is a match or not:
s <- 0:4
s %in% c(2,4)
>>>
In [12]: s = pd.Series(np.arange(5), dtype=np.float32)
In [13]: s.isin([2, 4])
Out[13]:
0 False
1 False
2 True
3 False
4 True
dtype: bool
The match function returns a vector of the positions of matches of its first argument in its second:
s <- 0:4
match(s, c(2,4))
tapply is similar to aggregate, but data can be in a ragged array, since the subclass sizes are possibly irregular. Using a data.frame called baseball, and retrieving information based on the array team:
In [14]: import random
In [15]: import string
In [16]: baseball = pd.DataFrame(
....: {
....: "team": ["team %d" % (x + 1) for x in range(5)] * 5,
....: "player": random.sample(list(string.ascii_lowercase), 25),
....: "batting avg": np.random.uniform(0.200, 0.400, 25),
....: }
....: )
....:
In [17]: baseball.pivot_table(values="batting avg", columns="team", aggfunc=np.max)
Out[17]:
team team 1 team 2 team 3 team 4 team 5
batting avg 0.352134 0.295327 0.397191 0.394457 0.396194
df <- data.frame(a=rnorm(10), b=rnorm(10))
subset(df, a <= b)
df[df$a <= df$b,] # note the comma
>>>
In [18]: df = pd.DataFrame({"a": np.random.randn(10), "b": np.random.randn(10)})
In [19]: df.query("a <= b")
Out[19]:
a b
1 0.174950 0.552887
2 -0.023167 0.148084
3 -0.495291 -0.300218
4 -0.860736 0.197378
5 -1.134146 1.720780
7 -0.290098 0.083515
8 0.238636 0.946550
In [20]: df[df["a"] <= df["b"]]
Out[20]:
a b
1 0.174950 0.552887
2 -0.023167 0.148084
3 -0.495291 -0.300218
4 -0.860736 0.197378
5 -1.134146 1.720780
7 -0.290098 0.083515
8 0.238636 0.946550
In [21]: df.loc[df["a"] <= df["b"]]
Out[21]:
a b
1 0.174950 0.552887
2 -0.023167 0.148084
3 -0.495291 -0.300218
4 -0.860736 0.197378
5 -1.134146 1.720780
7 -0.290098 0.083515
8 0.238636 0.946550
An expression using a data.frame called df in R with the columns a and b would be evaluated using with like so:
df <- data.frame(a=rnorm(10), b=rnorm(10))
with(df, a + b)
df$a + df$b # same as the previous expression
plyr is an R library for the split-apply-combine strategy for data analysis. The functions revolve around three data structures in R, a for arrays, l for lists, and d for data.frame. The table below shows how these data structures could be mapped in Python.
R
Python
array
list
lists
dictionary or list of objects
data.frame
dataframe
ddply
An expression using a data.frame called df in R where you want to summarize x by month:
require(plyr)
df <- data.frame(
x = runif(120, 1, 168),
y = runif(120, 7, 334),
z = runif(120, 1.7, 20.7),
month = rep(c(5,6,7,8),30),
week = sample(1:4, 120, TRUE)
)
ddply(df, .(month, week), summarize,
mean = round(mean(x), 2),
sd = round(sd(x), 2))
In [32]: cheese = pd.DataFrame(
....: {
....: "first": ["John", "Mary"],
....: "last": ["Doe", "Bo"],
....: "height": [5.5, 6.0],
....: "weight": [130, 150],
....: }
....: )
....:
In [33]: pd.melt(cheese, id_vars=["first", "last"])
Out[33]:
first last variable value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0
In [34]: cheese.set_index(["first", "last"]).stack() # alternative way
Out[34]:
first last
John Doe height 5.5
weight 130.0
Mary Bo height 6.0
weight 150.0
dtype: float64
cast
In R acast is an expression using a data.frame called df in R to cast into a higher dimensional array:
In [40]: df.groupby(["Animal", "FeedType"])["Amount"].sum()
Out[40]:
Animal FeedType
Animal1 A 10
B 5
Animal2 A 2
B 13
Animal3 A 6
Name: Amount, dtype: int64
pandas has a data type for categorical data.
cut(c(1,2,3,4,5,6), 3)
factor(c(1,2,3,2,2,3))
In pandas this is accomplished with pd.cut and astype("category"):
Since pandas aims to provide a lot of the data manipulation and analysis functionality that people use for, this page was started to provide a more detailed look at the and its many third party libraries as they relate to pandas. In comparisons with R and CRAN libraries, we care about the following things:
For transfer of DataFrame objects from pandas to R, one option is to use HDF5 files, see for an example.
We’ll start off with a quick reference guide pairing some common R operations using with pandas equivalents.
df.drop(cols_to_drop, axis=1) but see
Slicing with R’s
The method is similar to base R aggregate function.
For more details and examples see .
The method is similar to R %in% operator:
For more details and examples see .
In pandas we may use method to handle this:
For more details and examples see .
The method is similar to the base R subset function. In R you might want to get the rows of a data.frame where one column’s values are less than another column’s values:
In pandas, there are a few ways to perform subsetting. You can use or pass an expression as if it were an index/slice as well as standard boolean indexing:
For more details and examples see .
In pandas the equivalent expression, using the method, would be:
In certain cases will be much faster than evaluation in pure Python. For more details and examples see .
In pandas the equivalent expression, using the method, would be:
For more details and examples see .
In Python, this list would be a list of tuples, so method would convert it to a dataframe as required.
For more details and examples see .
In Python, the method is the R equivalent:
For more details and examples see .
In Python the best way is to make use of :
Python can approach this in two different ways. Firstly, similar to above using :
The second approach is to use the method:
For more details and examples see or .
For more details and examples see and the . There is also a documentation regarding the .