Skip to content

[wip] adding a part for pandas.df to hash_value #575

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

djarecka
Copy link
Collaborator

@djarecka djarecka commented Sep 2, 2022

Types of changes

  • Bug fix (non-breaking change which fixes an issue)

Summary

adding a section to hash_value that is for pandas.DF: changing df to dict before calculating the hash value.

@yibeichan - could you please see if this small change fixes some of your issues? it did fix the issue that I had with running my workflow within the jupyter notebook...

Checklist

  • I have added tests to cover my changes (if necessary)
  • I have updated documentation (if necessary)

@codecov
Copy link

codecov bot commented Sep 2, 2022

Codecov Report

Base: 77.06% // Head: 77.04% // Decreases project coverage by -0.01% ⚠️

Coverage data is based on head (875fd17) compared to base (219c721).
Patch coverage: 0.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #575      +/-   ##
==========================================
- Coverage   77.06%   77.04%   -0.02%     
==========================================
  Files          20       20              
  Lines        4316     4322       +6     
  Branches     1213     1217       +4     
==========================================
+ Hits         3326     3330       +4     
- Misses        802      803       +1     
- Partials      188      189       +1     
Flag Coverage Δ
unittests 76.95% <0.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pydra/engine/helpers.py 79.52% <0.00%> (-0.38%) ⬇️
pydra/engine/core.py 89.04% <0.00%> (+0.06%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@yibeichan
Copy link
Collaborator

@djarecka thank you! I'll try it today and let you now

@yibeichan
Copy link
Collaborator

@djarecka this change is specific for a list of pandas.Dataframe as inputs, correct?
my notebook works fine with or without this change because I'm using python 3.7? If this can solve your problem, then it probably can solve potential problems in higher python as well.
However, I exported my notebook as .py and got errors (not related to your change here)

File "/Users/yibeichen/miniconda3/envs/pydra/lib/python3.7/asyncio/futures.py", line 181, in result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

this error happens after secondlevel_estimation when starting statistical testing.
Only .py script has this error, notebook works fine.

@tclose
Copy link
Contributor

tclose commented Apr 2, 2025

I think this should be covered by the sequence hashing functionality shouldn't it @effigies ?

@effigies
Copy link
Contributor

effigies commented Apr 2, 2025

Seems like it's worth just testing.

@tclose
Copy link
Contributor

tclose commented Apr 3, 2025

Doesn't work, different data frames evaluate to the same hash... Looks like we need to do some more work on the backstop object hash.

At the moment we are a bit stuck no matter which way we err, if the same value produces a different hash you run the risk of workflows getting stuck halfway through if the downstream node can identify the upstream node (bad). However, if different values map onto the same hash you run the risk of producing the wrong results (worse). If #784 is implemented we could avoid the workflows getting stuck and then we could just default to cloudpickle to guarantee that different values map onto different hashes at least, and just throw a warning that the cache is likely to be missed in subsequent runs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants