1.) What types of data formats (Syslog, CEF, custom) can be taken up by Logstash for which it has some kind of mapping mechanism to ingest the data into Log Analytics workspace.
Logstash is an event pipeline system. it has many input plugins https://www.elastic.co/guide/en/logstash/current/input-plugins.html
It can also transform data during parsing. So to answer many data formats can be changed to match CEF output. This VMSS solution was designed specific to solve customer need to get SYSLOG (CEF FORMAT) messages, add geo ip information, then send to log analytics.
2.) Does Logstash have any mapping mechanism to map data format from any data source to convert into CEF or syslog which I suppose are the preferred choices of Sentinel?
Yes please review on Logstash documentation.
3.) Do we really need Log Analytics Agent in between Logstash and Log Analytics workspace of sentinel?
For this scenario, yes. CommonEventLogs come from the agent only. You could change the output to Custom Logs using the output connector for Log Analytics.
4.) Which would be the best choice of data format (syslog, CEF, or custom) out from logstash to ingest into Log Analytics workspace of sentinel?
There is no simple answer. Depends on the data source.
5.) Can the logstash-vmss be deployed on-prem?
Logstash can be deployed anywhere. So can the Log A Agent. The VMSS is specific to Azure resource. You would need to write your own installer scripts for on-prem
6.) Could you please suggest to me the best choices of data intake format (I do understand various data sources may have their own data formats ) to Logstash from any data sources and data output format from Logstash to Log Analytics workspace of sentinel?
It depends on your data source? Is it a network Appliance like a firewall, normally that’s CEF. Windows Event Logs should go to windows event or security event.