Category "dataframe"

How to make the dataframe faster ? either by using dictionary or numpy?

I am new to data structures and I would like to make my code faster (this is just part of a bigger code). Using dataframes while looking up variables is slowing

Pandas: Creating multiple indicator columns after condition with dates

So I have a data set with about 70,000 data points, and I'm trying to test out some code on a sample data set to make sure it will work on the large one. The sa

Left join pandas if column value is within a certain range?

I was wondering if it were possible to merge two datasets if the values were in a certain range of each other. For example, If I want to join on zip codes, then

Is it possible to know the size of a variable that is being created while the function is running?

I am very new to R and I was exploring a function in a library that download data from a server and leaves the data as dataframe. The data are stored in a varia

Dataframe in Scala

I am trying to train the model for recommendation for movie. I have a dataset which has list of all the casts, movie details with description. based on the occu

What does "100 *" mean in "100 * df. isna().mean()"?

Can anyone explain what is the use of 100 * in the following line of code: 100 * df.isna().mean() Is it intended to get the percentage of the average value?

Dataframe in Scala

I am trying to train the model for recommendation for movie. I have a dataset which has list of all the casts, movie details with description. based on the occu

How to read excel file with multiple sheets from python? I got error saying 'pandas' has no attribute 'excel'

At first, I wrote: import numpy as np import pandas as pd import glob all_data = pd.DataFrame() for f in glob.glob("*.xlsx"): df = pd.read_excel(f) all_

Checking the normality assumption of a linear mixed effects model

I have the following code for an LME: IDRTlme <- lme(Score ~ Group*Condition, random = ~1|ID, data=IDRT) I want to check the normality assumption, and so I h

Calculate value based on previous value and multiplication

I am trying to do something which is very simple in excel, but I cant seem to find the way the way to do it in python. I want to calculate the next value in a d

Is there an R function to convert 'flowFrame' structure of 'flowCore' package to a 'data.frame'?

Objective: To view .fcs data as a dataframe using R language. Flow Cytometry data comes in .fcs file format. The file is read in the flowFrame structure produce

Vaex copy columns between dataframes

I have a dataframe that I performed a filter on and then added some virtual columns. I wish to add those columns back in with the original data frame. Here is m

Python API Call: JSON to Pandas DF

I'm working on pulling data from a public API and converting the response JSON file to a Pandas Dataframe. I've written the code to pull the data and gotten a s

Adding new dataframe colonms using information extracted from the url in the url column, but the url could be missing information

Given: A pandas dataframe that contains a user_url column among other columns. Expectation: New columns added to the original dataframe where the columns are co

Pandas groupby feature question for output CSV

I have the following code df.groupby('AccountNumber')[['TotalStake','TotalPayout']].sum() which displays as I would like it to in pandas The issue is when I ou

Create multiple DataFrames using data from an api

I'm using the world bank API to analyze data and I want to create multiple data frames with the same indicators for different countries. import wbgapi as wb imp

Removing nested variables if there are NAs in certain variables inside the nested variable

I have a dataframe that looks something like this: df <- data.frame(gvkey = c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6), date = c(01,02,03,01,02,03,01,02,03,01,0

Is there a way to validate data type lengths in Pandas when using the read_csv function?

I'm trying to put some sort of length validation for columns using Pandas. For example, let's say I have a csv named test.csv that has the following data within

Apply loc to the entire dataframe but one column (keep the one column as it was and not remove it)

I am trying to divide the entire dataframe by a fix number but I want to keep the 'Year' column as is. I tried dividing the entire df with 100 and then multiply

Pandas - Cross referencing with DatetimeIndex - Groupby

I have data of many companies by month (End of Month). I want to create a new columns with groupby for each company where: new_col from Jul of this year to Jun