Forum Discussion
CharlotteHawkes
May 02, 2023Copper Contributor
Complicated Automation Help - 11.5 million records to process!
Hi, I’ve got a ton of data to process ie. 23 files with 500,000 rows each - the whole public Companies House People with Significant Control database basically. The files are .txt files where all...
- May 02, 2023
Please check if there is confidential data in your post and remove all confidential data.
=MID(A2,SEARCH("company_number",A2)+17,8)
=MID(A2,SEARCH("year",A2)+6,4)
If all the records have the same pattern you can try these formulas which are in cells B2 and C2 in the screenshot. The formulas can be filled down as required.
Patrick2788
May 02, 2023Silver Contributor
This is public data so there's no harm in posting this link:
http://download.companieshouse.gov.uk/en_pscdata.html
This may be a way to get cleaner copies of the files (if these are the ones you're looking for). The site says the files are available in JSON format but I downloaded the smallest one and it's a txt file (PowerQuery can handle both formats). It appears the columns are delimited cleanly.
- CharlotteHawkesMay 02, 2023Copper ContributorHi Patrick, thanks for your response, yes that's where I sourced the data from - unfortunately I can't get the columns to delimit clearly as once delimited some rows have more "fields" of info than others so the columns of data don't line up cleanly. For eg. "year" can be in column P for some rows, column z for other rows etc. I'm ok on excel but not amazing so maybe it's how I'm delimiting? If you can tell me how you did it to format them cleanly that would be amazing.