What to do when your dataset is too big for Excel go to Jupiter, Python and Pandas
Well, I’ve had two events over where Excel (or Google Sheets) is just too small and/or not maintainable, so the question is what we can do about it. For instance if you have over a million rows of data (looking at your Brian) or where there is lots of logics (looking at you https://restart.us) There are a couple of choices:
- Port to a proprietary Business Intelligence system. The ones I’ve tried are Tableau and PowerBi. Both are really good, but they do lock you into that vendors way of working.
- Move to a SQL-backed system. Dump the data into a SQL system and make the data small enough to analyze with either Excel or switch to Python. The disadvantage is that you do have to learn SQL which can be a little complicated. And you have to build your customer user interface.
- Move to Python/Jupyter. This is the big switch, now you are becoming a programmer and you have to think about the making that investment. On the other hand, using Jupyter gives you a semi-decent user interface particularly if it is not for lots of people. And there is lots of work going on here with Google Colab and the many Jupyter Notebook hosters that are out there.
So, if you have anyone who can write software or is willing to learn, the last alternative is probably the sweet spot for Python/Jupyter and there is a huge base of developers working on this.
The other nice thing is that Google has done quite a bit of work to make getting started easier. So here is what you need to do:
- Take your data either Excel files or a CSV and move it to Google Drive.
- Then start https://colab.research.google.com which is a free service that gives you a Jupyter environment, this is a virtual machine that is running Jupyter, formerly called iPython, this is a web interface that let’s you merge code and text into a single web page.
- Load the data and it is very easy to slice columns with commands like
csv["Name"]so you can index by column names. Now you can see that you can also index by row numbers as well with
csv["Name"].iloc[[3:4]]would select rows 3 and 4 and column Name.
- And you can slice by values as well so
csv[csv["Value"] > 2]would find all rows where the Value is bigger than 2